feat: William: Google image scraper #312
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, I just completed the task
Approach
My approach to do this task is to use selenium external dependency to scrape the information. So in short, it's a web automation that really opens the web browser than can capture the information right there. There's a challenge from Google Captcha that limit the access from bot or automation, but with customized automation browser setting, google treat it as normal user
Setup
Since it's required to do in this current repo, and there's no requirement to use framework, I do this using pure ruby, for other dependencies I use require function
I also separate between execution function and main scraper logic to make it more maintainable and easier to read.
Main execution function is start_scraper.rb, where query args is required, This is example to call the execution file
ruby start_scraper.rb query="Van Gogh Painting"
Main Logic
In requirement, it mention that we can scrape image result other than painting, then image result is 'Image' field box, which is kinda different with 'artwork' box. So I separate function so we can scrape it separately. If they have both then we fetch both of them.
For result, I put into
result
directory, inside the query search directory name as well.There are 3 files in this directory, expected array, page_source, and screenshot file. For page source and screenshot is the same as in the example. But for expected array, I separate results from image with artworks to a separate json field.
Testing
I use rspec testing for this logic, I apply integration testing in my google scraper test, because I check your example result also do integration testing, by fetch the real data in website and compare results from that, so I adjust convention accordingly.
I also add a small test for execution function to add more coverage