Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(scrapers): Van Gogh scraper #311

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,5 @@ build-iPhoneSimulator/
# unless supporting rvm < 1.11.0 or doing something fancy, ignore this:
.rvmrc
.DS_Store

.byebug_history
3 changes: 3 additions & 0 deletions .rspec
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
--color
--format documentation
--require spec_helper
8 changes: 8 additions & 0 deletions Gemfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# frozen_string_literal: true

source "https://rubygems.org"

# gem "rails"
gem 'nokolexbor', '~> 0.6.0'
gem 'byebug'
gem 'rspec'
36 changes: 36 additions & 0 deletions Gemfile.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
GEM
remote: https://rubygems.org/
specs:
byebug (11.1.3)
diff-lcs (1.6.0)
nokolexbor (0.6.0)
nokolexbor (0.6.0-arm64-darwin)
nokolexbor (0.6.0-x86_64-darwin)
nokolexbor (0.6.0-x86_64-linux)
rspec (3.13.0)
rspec-core (~> 3.13.0)
rspec-expectations (~> 3.13.0)
rspec-mocks (~> 3.13.0)
rspec-core (3.13.3)
rspec-support (~> 3.13.0)
rspec-expectations (3.13.3)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-mocks (3.13.2)
diff-lcs (>= 1.2.0, < 2.0)
rspec-support (~> 3.13.0)
rspec-support (3.13.2)

PLATFORMS
arm64-darwin
ruby
x86_64-darwin
x86_64-linux

DEPENDENCIES
byebug
nokolexbor (~> 0.6.0)
rspec

BUNDLED WITH
2.6.3
16 changes: 16 additions & 0 deletions lib/scrapers/generic.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
module Scrapers
class Generic
attr_accessor :selector, :processor

DEFAULT_PROCESSOR_FN = ->(item) { item.text }

def initialize(selector:, processor: DEFAULT_PROCESSOR_FN)
@selector = selector
@processor = processor
end

def scrape(html)
@processor.call(html.css(@selector))
end
end
end
45 changes: 45 additions & 0 deletions lib/scrapers/google/gallery.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
require_relative 'image'
require_relative 'image_replacer_script'
require_relative '../generic'

module Scrapers
module Google
class Gallery
DEFAULT_SELECTOR = 'div.iELo6'.freeze
DEFAULT_SCRAPERS = {
name: Scrapers::Generic.new(selector: 'div.pgNMRc'),
extensions: Scrapers::Generic.new(selector: 'div.cxzHyb', processor: ->(div) { div.text.empty? ? nil : [div.text] }),
link: Scrapers::Generic.new(selector: 'a', processor: -> (links) { 'https://www.google.com' + links[0]&.attributes['href']&.text }),
image: Scrapers::Google::Image.new
}
DEFAULT_SCRIPT_SCRAPER = Scrapers::Google::ImageReplacerScript

def initialize(parser:, selector: DEFAULT_SELECTOR, scrapers: DEFAULT_SCRAPERS, script_scraper: DEFAULT_SCRIPT_SCRAPER)
@parser = parser
@selector = selector
@scrapers = scrapers
@script_scraper = script_scraper.is_a?(Class) ? script_scraper.new : script_scraper
end

def scrape(input)
html = @parser.HTML(input)

output = []
@scrapers[:image].image_map = @script_scraper.scrape(html)

html.css(@selector).each do |item|
output_item = {}

@scrapers.each_pair do |key, scraper|
value = scraper.scrape(item)
output_item[key] = value if value
end

output << output_item
end

output
end
end
end
end
33 changes: 33 additions & 0 deletions lib/scrapers/google/image.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
module Scrapers
module Google
class Image
DEFAULT_SELECTOR = 'img.taFZJe'.freeze
PLACEHOLDER = 'data:image/gif;base64,R0lGODlhAQABAIAAAP///////yH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=='.freeze

attr_accessor :selector, :image_map

def initialize(selector: DEFAULT_SELECTOR)
@selector = selector
end

def scrape(html)
images = html.css(@selector)
return if images.empty?
image = images[0]

image_src = image.attributes['src']&.text
image_url = image_src
image_data_src = image.attributes['data-src']&.text

if image_data_src
image_url = image_data_src
elsif image_src == PLACEHOLDER
image_id = image.attributes['id']&.text
image_url = image_map[image_id] if image_map&.key?(image_id)
end

image_url
end
end
end
end
49 changes: 49 additions & 0 deletions lib/scrapers/google/image_replacer_script.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
module Scrapers
module Google
class ImageReplacerScript
DEFAULT_SELECTOR = "script".freeze
IMAGE_REPLACER_FN = "_setImagesSrc".freeze
IMAGE_DATA_PATTERN = /s='(.*?)';.*?var ii=\['(.*?)'\]/

attr_accessor :selector, :image_replacer_fn

def initialize(selector: DEFAULT_SELECTOR, image_replacer_fn: IMAGE_REPLACER_FN)
@selector = selector
@image_replacer_fn = image_replacer_fn
end

def scrape(html)
return {} if html.nil?

scrape_image_map(scrape_image_replacer_script(html))
end

private

def scrape_image_replacer_script(html)
html.css(@selector)
.select { |script| script.text.include?(@image_replacer_fn) }
.map(&:text)
.join
end

def scrape_image_map(script)
return {} if script.empty?

matches = script.scan(IMAGE_DATA_PATTERN)
return {} if matches.empty?

image_map = {}
matches.each do |base64, image_id|
image_map[image_id] = sanitize(base64)
end

image_map
end

def sanitize(base64)
base64.gsub(/\\x3d/, "=")
end
end
end
end
25 changes: 25 additions & 0 deletions scrape.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
require 'nokogiri'
require 'nokolexbor'
require 'json'

require_relative 'lib/scrapers/google/gallery'

input_file = ARGV[0] || './files/van-gogh-paintings.html'
parser_name = ARGV[1] || 'output.json'

case parser_name
when 'nokolexbor'
html_parser = Nokolexbor
when 'nokogiri'
html_parser = Nokogiri
else
html_parser = Nokolexbor
end

scraper = Scrapers::Google::Gallery.new(parser: html_parser)

html = File.read(input_file)
artworks = scraper.scrape(html)

output = {artworks: artworks}
puts output.to_json
Loading