Skip to content

A Python-based web crawler that uses Playwright to extract links from web pages, starting from a given URL. It collects all links within the same base domain, saves them to a JSON file.

Notifications You must be signed in to change notification settings

mahdizakery/py-link-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Py Link Crawler

Py Link Crawler is a Python-based web crawler that uses Playwright to extract and filter links from web pages. The crawler starts from a given URL and collects all links within the same base domain, saving them to a JSON file and removing duplicates.

Requirements

  • Python 3.7+
  • Playwright

Installation

  1. Clone the repository:

    git clone https://github.com/mahdizakery/py-link-crawler.git
    cd py-link-crawler
  2. Install the required packages:

    pip install -r requirements.txt
  3. Install Playwright browsers:

    playwright install

Usage

  1. Update the start_url variable in link_crawler.py with the URL you want to start crawling from.

  2. Run the crawler:

    python link_crawler.py
  3. The collected links will be saved to all_links.json.

Functions

  • get_base_domain(url): Extracts the base domain from a URL.
  • get_all_links(url, base_domain): Retrieves all links from a page and filters them by the base domain.
  • find_all_pages(start_url): Crawls the web starting from the given URL and collects all links within the same base domain.
  • save_links_to_json(links, filename): Saves the collected links to a JSON file.
  • remove_duplicates_from_json(filename): Removes duplicate links from the JSON file.

License

This project is licensed under the MIT License.

About

A Python-based web crawler that uses Playwright to extract links from web pages, starting from a given URL. It collects all links within the same base domain, saves them to a JSON file.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages