A concurrent web worker written in Go (Golang) designed to crawl websites efficiently while respecting basic crawling policies. The worker stops automatically after crawling a specified number of links (default: 64).
- Concurrent crawling: Uses Go's goroutines for parallel processing of URLs.
- Kill switch: Automatically stops after crawling
n
links (configurable). - Duplicate URL prevention: Tracks visited URLs to avoid reprocessing.
- HTML parsing: Extracts links using the
goquery
library. - Simple CLI: Easy to use with minimal configuration.
Below are step-by-step instructions to test the web worker application.
- Go: Ensure Go is installed on your system.
- Dependencies: Install the required Go packages.
go get github.com/PuerkitoBio/goquery
go get github.com/mattn/go-sqlite3
Start the server by running the following command in the terminal:
go run main.go
The server will start on http://localhost:8080
.
Send a POST request to start a new crawling job:
curl -X POST http://localhost:8080/crawl \
-H "Content-Type: application/json" \
-d '{"url":"https://prorobot.ai/hashtags"}'
Response:
{"job_id": "1623751234567890000"}
- Save the
job_id
for further testing.
Use the job_id
to check the status of a crawling job:
curl http://localhost:8080/jobs/{job_id}/status
Replace {job_id}
with the actual job ID.
Example:
curl http://localhost:8080/jobs/1623751234567890000/status
Response:
{"job_id": "1623751234567890000", "status": "running", "processed": 15, "total": 64}
Retrieve a list of all jobs (both active and completed):
curl http://localhost:8080/jobs
Response:
[
{"job_id": "1623751234567890000", "status": "running", "processed": 15, "total": 64},
{"job_id": "1623751234567890001", "status": "completed", "processed": 64, "total": 64}
]
Retrieve the results of a completed job:
curl http://localhost:8080/jobs/{job_id}/results
Replace {job_id}
with the actual job ID.
Example:
curl http://localhost:8080/jobs/1623751234567890000/results
Response:
[
{"url": "https://prorobot.ai/hashtags", "title": "Example Page", "content": "Lorem ipsum..."},
...
]
-
Starting a Job:
- A new job is created, and a
job_id
is returned. - The job begins crawling the provided URL.
- A new job is created, and a
-
Checking Job Status:
- If the job is running, the status will be
"running"
with the number of processed links. - If the job is completed, the status will be
"completed"
.
- If the job is running, the status will be
-
Listing All Jobs:
- Returns a list of all jobs with their
job_id
,status
,processed
, andtotal
links.
- Returns a list of all jobs with their
-
Retrieving Job Results:
- If the job is completed, returns the crawled data (URL, title, and content).
- If the job is still running, returns a
"processing"
status.
-
Concurrency: Multiple jobs can run simultaneously. Each job is tracked independently.
-
Error Handling: If a job ID is invalid or not found, the API will return a
404 Not Found
error.
-
Start a new job:
curl -X POST http://localhost:8080/crawl -H "Content-Type: application/json" -d '{"url":"https://prorobot.ai/hashtags"}'
-
Check the job status:
curl http://localhost:8080/jobs/1623751234567890000/status
-
List all jobs:
curl http://localhost:8080/jobs
-
Retrieve results after the job completes:
curl http://localhost:8080/jobs/1623751234567890000/results
This testing guide ensures you can verify all functionality of the web worker application. Let me know if you need further assistance! 🚀