Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement headless browser based scraping #169

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ajxu2
Copy link
Contributor

@ajxu2 ajxu2 commented Feb 16, 2025

Notes:

  • Style is probably not good, someone more experienced should look over this
  • Tests are almost completely broken because the constructors of DiningParser and LocationBuilder were changed
  • Tests also do not integrate well with Puppeteer because of SyntaxError: Cannot use import statement outside a module
  • Currently we wait for 10 seconds before fetching the contents of the web page (to ensure that all the JavaScript is getting run). This means that the Dining API will take about 6 minutes to refresh. More importantly, immediately after deployment, there will be a 6-minute period where the Dining API is unusable. Is this a concern?

Copy link

vercel bot commented Feb 16, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
dining-api ❌ Failed (Inspect) Feb 16, 2025 11:55pm

@GhostOf0days
Copy link
Member

Can take step by step. First, need to resolve merge conflicts. Also, 6 minute downtime for Dining API is not a concern. Can merge at liek 4 AM or something.

@cirex-web
Copy link
Collaborator

Also, 6 minute downtime for Dining API is not a concern. Can merge at liek 4 AM or something.

I set up healthcheck on a railway sandbox so that the internal port is switched over only when the /locations endpoint is ready. We'll now supposedly have zero minutes of downtime! :)

image Reloading API but it's publicly still accessible. image

@cirex-web
Copy link
Collaborator

cirex-web commented Feb 17, 2025

So, here's the situation

Screenshot by puppeteer after waiting for domcontentloaded and networkidle2
image

Screenshot by puppeteer after waiting another 20s
image

Correct data
image

Unclear how we can fix this... is there some limitation with headless browsing that isn't apparent in a real browser? (I was able to replicate the persistent incorrect data in Chrome as well...)

I wonder if it would be wise to take the majority result out of n queries to the same URL and call it a day.

The probability of failure is ~.07, so the probability of having ≤ 5 failures in 10 fetches is .99969, which is pretty good. (BUT empirically, is seems like the failures are grouped together and are thus not independent.)
image

Failure rate with puppeteer (no hardcoded await): 0.05573770491803279 (n=100+)
Failure rate with fetch + 1s delay between fetches: 0.022727272727272728 (n=50?)
Failure rate with fetch (no delay in between): 0.02815533980582524 (n=pretty large)

Fluctuation may be due to noise/time of experimentation. More rigorous testing is needed to determine if puppeteer is more or less reliable than fetching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants