A web scraping project built using Scrapy, a fast, high-level web crawling and web scraping framework for Python. This project provides a customizable and scalable setup for scraping data from websites.
- Scalable and modular: Easily add spiders to scrape different websites.
- Efficient scraping: Leverages Scrapy's built-in performance optimization.
- Custom pipelines: Process and save scraped data to different formats or databases.
- Middleware integration: Add request headers, handle proxies, or manage retries.
- Configuration flexibility: Customize settings for each spider.
Below is the directory structure of the project:
website_scraper/
├── LICENSE
├── README.md
├── requirements.txt
├── scrapy.cfg
└── website_scraper/
├── __init__.py
├── __pycache__/
│ ├── __init__.cpython-38.pyc
│ ├── middlewares.cpython-38.pyc
│ ├── pipelines.cpython-38.pyc
│ └── settings.cpython-38.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders/
├── __init__.py
├── __pycache__/
│ ├── __init__.cpython-38.pyc
│ └── website_scraper_spider.cpython-38.pyc
└── website_scraper_spider.py
5 directories, 17 files
- items.py: Define the data models for scraped items.
- pipelines.py: Process and save scraped data (e.g., save to a database, JSON, or CSV).
- middlewares.py: Custom middleware for handling requests, responses, or errors.
- settings.py: Configure Scrapy settings like user agents, delays, or pipelines.
- website_scraper_spider.py: Example spider for scraping data from a specific website.
- Python 3.8 or higher
- Pip (Python package manager)
- Clone this repository:
git clone https://github.com/SciFrozen-Git/website-scraper.git
cd website-scraper
- Create a virtual environment:
python3 -m venv <venv>
- Activate Virtual Environment (On Linux/Mac)
source <venv>/bin/activate
- Activate Virtual Environment (On Windows)
<venv>\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Verify the installation:
scrapy version
- Run a spider:
scrapy crawl website_scraper
- Output data to a file (e.g., JSON):
scrapy crawl website_scraper_spider -o output.json
Example:
Open the website_scraper/spiders/website_scraper_spider.py
file and customize the start_urls and parse() method to scrape the required data.
Adding a New Spider:
- Create a new file in the spiders/ directory:
touch website_scraper/spiders/new_spider.py
- Define a new spider class:
python
import scrapy
class NewSpider(scrapy.Spider):
name = 'new_spider'
start_urls = ['https://example.com']
def parse(self, response):
# Add scraping logic here
pass
- Run the new spider:
scrapy crawl new_spider
Configuring Settings:
Edit settings.py to customize:
- User-Agent: Set a custom User-Agent for your requests.
- Download Delay: Add delays to avoid overloading websites.
- Pipelines: Enable specific pipelines to process scraped data.
Contributions are welcome! To contribute:
-
Fork the repository to your own GitHub account.
-
Clone the forked repository to your local machine:
git clone https://github.com/SciFrozen-Git/website-scraper.git
- Create a new branch for your feature/bugfix (make sure to branch off from the main branch):
git checkout -b <feature-name>
- Make your changes and stage them for commit:
git add .
- Commit your changes with a clear message:
git commit -m "Add <feature-name>"
- Push your changes to your branch:
git push origin <feature-name>
- Open a pull request on GitHub and provide a description of your changes.
This project is licensed under the MIT License.
If you like my work and want to show your support, you can buy me a coffee or make a donation! ☕
Send your donations to the following wallet address:
Bitcoin Address:
1JSHP87RKNg2okh1Bx7PrfdghHdQBrsBj1
Thanks for your support! 🙏