Py Link Crawler is a Python-based web crawler that uses Playwright to extract and filter links from web pages. The crawler starts from a given URL and collects all links within the same base domain, saving them to a JSON file and removing duplicates.
- Python 3.7+
- Playwright
-
Clone the repository:
git clone https://github.com/mahdizakery/py-link-crawler.git cd py-link-crawler
-
Install the required packages:
pip install -r requirements.txt
-
Install Playwright browsers:
playwright install
-
Update the
start_url
variable inlink_crawler.py
with the URL you want to start crawling from. -
Run the crawler:
python link_crawler.py
-
The collected links will be saved to
all_links.json
.
get_base_domain(url)
: Extracts the base domain from a URL.get_all_links(url, base_domain)
: Retrieves all links from a page and filters them by the base domain.find_all_pages(start_url)
: Crawls the web starting from the given URL and collects all links within the same base domain.save_links_to_json(links, filename)
: Saves the collected links to a JSON file.remove_duplicates_from_json(filename)
: Removes duplicate links from the JSON file.
This project is licensed under the MIT License.