Skip to content

A powerful and easy-to-use tool built with Scrapy and Node.js that allows you to scrape and download the entire source code and assets of any website. Perfect for developers, researchers, and web enthusiasts who need offline access to websites or want to analyze their structure.

License

Notifications You must be signed in to change notification settings

SciFrozen-Git/website-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Website Scraper

A web scraping project built using Scrapy, a fast, high-level web crawling and web scraping framework for Python. This project provides a customizable and scalable setup for scraping data from websites.


Table of Contents


Features

  • Scalable and modular: Easily add spiders to scrape different websites.
  • Efficient scraping: Leverages Scrapy's built-in performance optimization.
  • Custom pipelines: Process and save scraped data to different formats or databases.
  • Middleware integration: Add request headers, handle proxies, or manage retries.
  • Configuration flexibility: Customize settings for each spider.

Project Structure

Below is the directory structure of the project:

website_scraper/
├── LICENSE
├── README.md
├── requirements.txt
├── scrapy.cfg
└── website_scraper/
    ├── __init__.py
    ├── __pycache__/
    │   ├── __init__.cpython-38.pyc
    │   ├── middlewares.cpython-38.pyc
    │   ├── pipelines.cpython-38.pyc
    │   └── settings.cpython-38.pyc
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders/
        ├── __init__.py
        ├── __pycache__/
        │   ├── __init__.cpython-38.pyc
        │   └── website_scraper_spider.cpython-38.pyc
        └── website_scraper_spider.py

5 directories, 17 files

Key Files:

  • items.py: Define the data models for scraped items.
  • pipelines.py: Process and save scraped data (e.g., save to a database, JSON, or CSV).
  • middlewares.py: Custom middleware for handling requests, responses, or errors.
  • settings.py: Configure Scrapy settings like user agents, delays, or pipelines.
  • website_scraper_spider.py: Example spider for scraping data from a specific website.

Installation

Prerequisites:

  • Python 3.8 or higher
  • Pip (Python package manager)

Steps:

  1. Clone this repository:
git clone https://github.com/SciFrozen-Git/website-scraper.git
cd website-scraper
  1. Create a virtual environment:
python3 -m venv <venv>
  • Activate Virtual Environment (On Linux/Mac)
source <venv>/bin/activate
  • Activate Virtual Environment (On Windows)
<venv>\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Verify the installation:
scrapy version

Usage

  1. Run a spider:
scrapy crawl website_scraper
  1. Output data to a file (e.g., JSON):
scrapy crawl website_scraper_spider -o output.json

Example:

Open the website_scraper/spiders/website_scraper_spider.py file and customize the start_urls and parse() method to scrape the required data.


Customization

Adding a New Spider:

  1. Create a new file in the spiders/ directory:
touch website_scraper/spiders/new_spider.py
  1. Define a new spider class:
python
  import scrapy

  class NewSpider(scrapy.Spider):
    name = 'new_spider'
    start_urls = ['https://example.com']

    def parse(self, response):
      # Add scraping logic here
      pass
  1. Run the new spider:
  scrapy crawl new_spider

Configuring Settings:

Edit settings.py to customize:

  • User-Agent: Set a custom User-Agent for your requests.
  • Download Delay: Add delays to avoid overloading websites.
  • Pipelines: Enable specific pipelines to process scraped data.

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository to your own GitHub account.

  2. Clone the forked repository to your local machine:

git clone https://github.com/SciFrozen-Git/website-scraper.git
  1. Create a new branch for your feature/bugfix (make sure to branch off from the main branch):
git checkout -b <feature-name>
  1. Make your changes and stage them for commit:
git add .
  1. Commit your changes with a clear message:
git commit -m "Add <feature-name>"
  1. Push your changes to your branch:
git push origin <feature-name>
  1. Open a pull request on GitHub and provide a description of your changes.

License

This project is licensed under the MIT License.


Support the Project

If you like my work and want to show your support, you can buy me a coffee or make a donation! ☕

Send your donations to the following wallet address:

Bitcoin Address:

1JSHP87RKNg2okh1Bx7PrfdghHdQBrsBj1

Thanks for your support! 🙏

About

A powerful and easy-to-use tool built with Scrapy and Node.js that allows you to scrape and download the entire source code and assets of any website. Perfect for developers, researchers, and web enthusiasts who need offline access to websites or want to analyze their structure.

Topics

Resources

License

Stars

Watchers

Forks

Languages