You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RecipeRadar is a recipe search engine that provides ingredient-based search across recipes sourced from around the World Wide Web.
Currently the system uses the recipe-scrapers Python library to parse recipes from webpage HTML and to produce a set of relevant metadata fields which are then processed and used to build a search engine index.
We (attempt to) contribute fixes and improvements to recipe-scrapers because we stand to benefit from improved coverage, accuracy and performance when the library is correct, well-tested and optimized.
Challenges
Continuous testing against live and/or near-live data is currently outside of the scope of the recipe-scrapers library. We would like to provide ongoing healthcheck-style testing against recently-sourced webpages to identify opportunities for improvement.
TARDIR Goals
A dashboard that can provide a colour-coded Gantt chart for each supported website is the intended consumer of tardir data.
The passage of time should be represented horizontally, and each website host should be represented by a row-oriented bar (or sparkline) that allows rapid visual identification of changes over time.
A precise algorithm to determine the display colour from a point-in-time onwards is TBD, but it might, for example, include that 100% success would be represented in turquoise, and 100% scraper error be represented in red-orange (ref). Site outages/unavailability should be represented differently -- perhaps by grey, to indicate that support has been implemented but that data is unavailable.
tardir is responsible for assembling the datapoints to make this representation possible.
The representation is intended to guide our team towards improved coverage and accuracy for our users, and reduced resource costs (time, memory, and energy) to provide the infrastructure.
Rows in the chart should be sorted by health score, with the least healthy rows presented at the top (first-in-line for maintenance attention).
⚠️ There is a possibility in some cases that fixes may not be possible within the recipe-scrapers library; upstream modifications will be required in some situations. It seems difficult to determine how to ensure that problems are solved (and solved adequately) while also avoiding the possibility for a small subset of sites to monopolize maintainer time due to site errors.
Alignment of incentives -- perhaps with multiple alignment paths depending on site culture/psychology -- may be able to assist with this.
Resources
The CommonCrawl project provides a regularly-updated collection of pages collected from the World Wide Web, organized according to various indexing schemes.
Workflow
The tardir workflow accepts two parameters as input:
A time T, used as an upper-bound on the parser (recipe-scrapers) version to use
A time C, used as an upper-bound on the content (CommonCrawl) to retrieve
Retrieve -- and where not available, collect the most-recent-known version of -- a sample of at-most 100 pages per-host at time T served by named hosts supported by recipe-scrapers at time C
HTTP redirect codes to the same host should be followed
HTTP redirect codes to different hosts should be counted (content.T.host.migrated)
HTTP errors should add a failure count to the host (content.T.host.failed)
Successful HTTP retrievals should be counted (content.T.host.retrieved)
Content size should be recorded (content.T.host.size)
The recipe-scrapers library at time C should be used to extract all recipe-related fields from the at-most 100 pages sampled per-host during the retrieval stage
Errors/exceptions should be counted (parse.T.host.error and parse.T.C.host.error)
Empty/none values should be counted (parse.T.host.partial and parse.T.C.host.partial)
Complete success should be counted (parse.T.host.success and parse.T.C.host.success)
Time spent parsing the recipes should be recorded (parse.T.host.duration and parse.T.C.host.duration)
Memory used parsing the recipes should be recorded (parse.T.host.memory and parse.T.C.host.memory)
The text was updated successfully, but these errors were encountered:
This will require at least two additional storage-related services.
webarchive - a service that will collect and retrieve archived webpage material. to some extent this will overlap/duplicate our existing use of squid as a persistent cache. I think we should try pywb.
crawlstats - a service to record and query crawling-related statistics. currently leaning towards suggesting carbon and using graphyte as the Python client
Background
RecipeRadar is a recipe search engine that provides ingredient-based search across recipes sourced from around the World Wide Web.
Currently the system uses the
recipe-scrapers
Python library to parse recipes from webpage HTML and to produce a set of relevant metadata fields which are then processed and used to build a search engine index.We (attempt to) contribute fixes and improvements to
recipe-scrapers
because we stand to benefit from improved coverage, accuracy and performance when the library is correct, well-tested and optimized.Challenges
Continuous testing against live and/or near-live data is currently outside of the scope of the
recipe-scrapers
library. We would like to provide ongoing healthcheck-style testing against recently-sourced webpages to identify opportunities for improvement.TARDIR Goals
A dashboard that can provide a colour-coded Gantt chart for each supported website is the intended consumer of
tardir
data.The passage of time should be represented horizontally, and each website host should be represented by a row-oriented bar (or sparkline) that allows rapid visual identification of changes over time.
A precise algorithm to determine the display colour from a point-in-time onwards is TBD, but it might, for example, include that 100% success would be represented in turquoise, and 100% scraper error be represented in red-orange (ref). Site outages/unavailability should be represented differently -- perhaps by grey, to indicate that support has been implemented but that data is unavailable.
See
grafana-gantt-panel
to get an idea of the expected output format.tardir
is responsible for assembling the datapoints to make this representation possible.The representation is intended to guide our team towards improved coverage and accuracy for our users, and reduced resource costs (time, memory, and energy) to provide the infrastructure.
Rows in the chart should be sorted by health score, with the least healthy rows presented at the top (first-in-line for maintenance attention).
recipe-scrapers
library; upstream modifications will be required in some situations. It seems difficult to determine how to ensure that problems are solved (and solved adequately) while also avoiding the possibility for a small subset of sites to monopolize maintainer time due to site errors.Alignment of incentives -- perhaps with multiple alignment paths depending on site culture/psychology -- may be able to assist with this.
Resources
The CommonCrawl project provides a regularly-updated collection of pages collected from the World Wide Web, organized according to various indexing schemes.
Workflow
The
tardir
workflow accepts two parameters as input:T
, used as an upper-bound on the parser (recipe-scrapers
) version to useC
, used as an upper-bound on the content (CommonCrawl) to retrieveRetrieve -- and where not available, collect the most-recent-known version of -- a sample of at-most
100
pages per-host at timeT
served by named hosts supported byrecipe-scrapers
at timeC
content.T.host.migrated
)content.T.host.failed
)content.T.host.retrieved
)content.T.host.size
)The
recipe-scrapers
library at timeC
should be used to extract all recipe-related fields from the at-most100
pages sampled per-host during the retrieval stageparse.T.host.error
andparse.T.C.host.error
)parse.T.host.partial
andparse.T.C.host.partial
)parse.T.host.success
andparse.T.C.host.success
)parse.T.host.duration
andparse.T.C.host.duration
)parse.T.host.memory
andparse.T.C.host.memory
)The text was updated successfully, but these errors were encountered: