Py Link Crawler

Py Link Crawler is a Python-based web crawler that uses Playwright to extract and filter links from web pages. The crawler starts from a given URL and collects all links within the same base domain, saving them to a JSON file and removing duplicates.

Requirements

Python 3.7+
Playwright

Installation

Clone the repository:

git clone https://github.com/mahdizakery/py-link-crawler.git
cd py-link-crawler

Install the required packages:
```
pip install -r requirements.txt
```
Install Playwright browsers:
```
playwright install
```

Usage

Update the start_url variable in link_crawler.py with the URL you want to start crawling from.
Run the crawler:
```
python link_crawler.py
```
The collected links will be saved to all_links.json.

Functions

get_base_domain(url): Extracts the base domain from a URL.
get_all_links(url, base_domain): Retrieves all links from a page and filters them by the base domain.
find_all_pages(start_url): Crawls the web starting from the given URL and collects all links within the same base domain.
save_links_to_json(links, filename): Saves the collected links to a JSON file.
remove_duplicates_from_json(filename): Removes duplicate links from the JSON file.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
link_crawler.py		link_crawler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Py Link Crawler

Requirements

Installation

Usage

Functions

License

About

Releases

Packages

Languages

mahdizakery/py-link-crawler

Folders and files

Latest commit

History

Repository files navigation

Py Link Crawler

Requirements

Installation

Usage

Functions

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages