Implement headless browser based scraping #169

ajxu2 · 2025-02-16T03:03:53Z

Notes:

Style is probably not good, someone more experienced should look over this
Tests are almost completely broken because the constructors of DiningParser and LocationBuilder were changed
Tests also do not integrate well with Puppeteer because of SyntaxError: Cannot use import statement outside a module
Currently we wait for 10 seconds before fetching the contents of the web page (to ensure that all the JavaScript is getting run). This means that the Dining API will take about 6 minutes to refresh. More importantly, immediately after deployment, there will be a 6-minute period where the Dining API is unusable. Is this a concern?

vercel · 2025-02-16T03:03:57Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
dining-api	❌ Failed (Inspect)			Feb 16, 2025 11:55pm

GhostOf0days · 2025-02-16T03:14:21Z

Can take step by step. First, need to resolve merge conflicts. Also, 6 minute downtime for Dining API is not a concern. Can merge at liek 4 AM or something.

…ompatability

cirex-web · 2025-02-17T01:00:43Z

Also, 6 minute downtime for Dining API is not a concern. Can merge at liek 4 AM or something.

I set up healthcheck on a railway sandbox so that the internal port is switched over only when the /locations endpoint is ready. We'll now supposedly have zero minutes of downtime! :)

Reloading API but it's publicly still accessible.

cirex-web · 2025-02-17T01:02:53Z

So, here's the situation

Screenshot by puppeteer after waiting for domcontentloaded and networkidle2

Screenshot by puppeteer after waiting another 20s

Correct data

~~Unclear how we can fix this... is there some limitation with headless browsing that isn't apparent in a real browser?~~ (I was able to replicate the persistent incorrect data in Chrome as well...)

I wonder if it would be wise to take the majority result out of n queries to the same URL and call it a day.

The probability of failure is ~.07, so the probability of having ≤ 5 failures in 10 fetches is .99969, which is pretty good. (BUT empirically, is seems like the failures are grouped together and are thus not independent.)

Failure rate with puppeteer (no hardcoded await): 0.05573770491803279 (n=100+)
Failure rate with fetch + 1s delay between fetches: 0.022727272727272728 (n=50?)
Failure rate with fetch (no delay in between): 0.02815533980582524 (n=pretty large)

Fluctuation may be due to noise/time of experimentation. More rigorous testing is needed to determine if puppeteer is more or less reliable than fetching.

Implement headless browser based scraping

cac3a7d

vercel bot had a problem deploying to Preview February 16, 2025 03:04 Failure

added dependencies

522ed90

vercel bot had a problem deploying to Preview February 16, 2025 03:29 Failure

Merge branch 'main' into pupeteer

5b895ab

vercel bot had a problem deploying to Preview February 16, 2025 22:53 Failure

fix: add dev command and throw error when concepts page is blank

e0bcde5

vercel bot had a problem deploying to Preview February 16, 2025 22:56 Failure

fix: use custom-installed chromium in dockerfile for multi-platform c…

fd32b39

…ompatability

vercel bot had a problem deploying to Preview February 16, 2025 23:55 Failure

railway-app bot temporarily deployed to sandbox February 16, 2025 23:59 Inactive

railway-app bot temporarily deployed to sandbox February 17, 2025 00:01 Inactive

railway-app bot temporarily deployed to sandbox February 17, 2025 00:05 Inactive

railway-app bot temporarily deployed to sandbox February 17, 2025 00:07 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement headless browser based scraping #169

Implement headless browser based scraping #169

ajxu2 commented Feb 16, 2025

vercel bot commented Feb 16, 2025 •

edited

Loading

GhostOf0days commented Feb 16, 2025

cirex-web commented Feb 17, 2025

cirex-web commented Feb 17, 2025 •

edited

Loading

Implement headless browser based scraping #169

Are you sure you want to change the base?

Implement headless browser based scraping #169

Conversation

ajxu2 commented Feb 16, 2025

vercel bot commented Feb 16, 2025 • edited Loading

GhostOf0days commented Feb 16, 2025

cirex-web commented Feb 17, 2025

cirex-web commented Feb 17, 2025 • edited Loading

vercel bot commented Feb 16, 2025 •

edited

Loading

cirex-web commented Feb 17, 2025 •

edited

Loading