Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a system for retrieval and assessment of recipe metadata from the World Wide Web #1

Open
jayaddison opened this issue Oct 27, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@jayaddison
Copy link
Member

Background

RecipeRadar is a recipe search engine that provides ingredient-based search across recipes sourced from around the World Wide Web.

Currently the system uses the recipe-scrapers Python library to parse recipes from webpage HTML and to produce a set of relevant metadata fields which are then processed and used to build a search engine index.

We (attempt to) contribute fixes and improvements to recipe-scrapers because we stand to benefit from improved coverage, accuracy and performance when the library is correct, well-tested and optimized.

Challenges

Continuous testing against live and/or near-live data is currently outside of the scope of the recipe-scrapers library. We would like to provide ongoing healthcheck-style testing against recently-sourced webpages to identify opportunities for improvement.

TARDIR Goals

A dashboard that can provide a colour-coded Gantt chart for each supported website is the intended consumer of tardir data.

The passage of time should be represented horizontally, and each website host should be represented by a row-oriented bar (or sparkline) that allows rapid visual identification of changes over time.

A precise algorithm to determine the display colour from a point-in-time onwards is TBD, but it might, for example, include that 100% success would be represented in turquoise, and 100% scraper error be represented in red-orange (ref). Site outages/unavailability should be represented differently -- perhaps by grey, to indicate that support has been implemented but that data is unavailable.

See grafana-gantt-panel to get an idea of the expected output format.

tardir is responsible for assembling the datapoints to make this representation possible.

The representation is intended to guide our team towards improved coverage and accuracy for our users, and reduced resource costs (time, memory, and energy) to provide the infrastructure.

Rows in the chart should be sorted by health score, with the least healthy rows presented at the top (first-in-line for maintenance attention).

⚠️ There is a possibility in some cases that fixes may not be possible within the recipe-scrapers library; upstream modifications will be required in some situations. It seems difficult to determine how to ensure that problems are solved (and solved adequately) while also avoiding the possibility for a small subset of sites to monopolize maintainer time due to site errors.

Alignment of incentives -- perhaps with multiple alignment paths depending on site culture/psychology -- may be able to assist with this.

Resources

The CommonCrawl project provides a regularly-updated collection of pages collected from the World Wide Web, organized according to various indexing schemes.

Workflow

The tardir workflow accepts two parameters as input:

  1. A time T, used as an upper-bound on the parser (recipe-scrapers) version to use
  2. A time C, used as an upper-bound on the content (CommonCrawl) to retrieve
  • Retrieve -- and where not available, collect the most-recent-known version of -- a sample of at-most 100 pages per-host at time T served by named hosts supported by recipe-scrapers at time C

    • HTTP redirect codes to the same host should be followed
    • HTTP redirect codes to different hosts should be counted (content.T.host.migrated)
    • HTTP errors should add a failure count to the host (content.T.host.failed)
    • Successful HTTP retrievals should be counted (content.T.host.retrieved)
    • Content size should be recorded (content.T.host.size)
  • The recipe-scrapers library at time C should be used to extract all recipe-related fields from the at-most 100 pages sampled per-host during the retrieval stage

    • Errors/exceptions should be counted (parse.T.host.error and parse.T.C.host.error)
    • Empty/none values should be counted (parse.T.host.partial and parse.T.C.host.partial)
    • Complete success should be counted (parse.T.host.success and parse.T.C.host.success)
    • Time spent parsing the recipes should be recorded (parse.T.host.duration and parse.T.C.host.duration)
    • Memory used parsing the recipes should be recorded (parse.T.host.memory and parse.T.C.host.memory)
@jayaddison jayaddison added the enhancement New feature or request label Oct 27, 2022
@jayaddison
Copy link
Member Author

This will require at least two additional storage-related services.

  • webarchive - a service that will collect and retrieve archived webpage material. to some extent this will overlap/duplicate our existing use of squid as a persistent cache. I think we should try pywb.
  • crawlstats - a service to record and query crawling-related statistics. currently leaning towards suggesting carbon and using graphyte as the Python client

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

1 participant