Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: William: Google image scraper #312

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Willi8910
Copy link

@Willi8910 Willi8910 commented Mar 19, 2025

Hi, I just completed the task

Approach
My approach to do this task is to use selenium external dependency to scrape the information. So in short, it's a web automation that really opens the web browser than can capture the information right there. There's a challenge from Google Captcha that limit the access from bot or automation, but with customized automation browser setting, google treat it as normal user

Setup
Since it's required to do in this current repo, and there's no requirement to use framework, I do this using pure ruby, for other dependencies I use require function

I also separate between execution function and main scraper logic to make it more maintainable and easier to read.
Main execution function is start_scraper.rb, where query args is required, This is example to call the execution file

ruby start_scraper.rb query="Van Gogh Painting"

Main Logic
In requirement, it mention that we can scrape image result other than painting, then image result is 'Image' field box, which is kinda different with 'artwork' box. So I separate function so we can scrape it separately. If they have both then we fetch both of them.

For result, I put into result directory, inside the query search directory name as well.
There are 3 files in this directory, expected array, page_source, and screenshot file. For page source and screenshot is the same as in the example. But for expected array, I separate results from image with artworks to a separate json field.

Testing
I use rspec testing for this logic, I apply integration testing in my google scraper test, because I check your example result also do integration testing, by fetch the real data in website and compare results from that, so I adjust convention accordingly.
I also add a small test for execution function to add more coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant