Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-filling the feb-may 2022 "dip" using canonical URL extraction #353

Open
philbudne opened this issue Nov 24, 2024 · 10 comments
Open

Re-filling the feb-may 2022 "dip" using canonical URL extraction #353

philbudne opened this issue Nov 24, 2024 · 10 comments
Assignees

Comments

@philbudne
Copy link
Contributor

Given the successful recovery using "blind fetching" S3 objects and using the extracted canonical URL, I suggested we might want to do the same thing for the period in 2022 (approx 2022-01-25 thru 2022-05-05?) where Xavier fetched all the S3 objects, looking only at the RSS files to extract URLs that we tried to fetch again (and found significant "link rot").

It looks like the researchers would prefer filling out 2022 to working on prior years (2019 and earlier).

To see if it might be a savings (assuming my memory that access to S3 is free from EC2 instances is correct) to scan the S3 objects from an EC2 instance, and pack them (without ANY further processing, including checking if the file has a canonical URL) into WARC files, I threw together a packer.py script with bits cribbed from hist-fetcher and parser (for RSS detection).

To get 100 HTML files it scanned 767 S3 objects (some non-existent, some 36 bytes or smaller which almost certainly contain a message that says it was a duplicate feed download), downloading a total of 5676225 bytes (avg 7400 bytes/obj scanned), and writing a WARC file that's 2726394 bytes (48% the size), so it might be worthwhile (with more consideration and math).

Then I created a t4g.nano instance ($0.0042/hr) to see how much faster downloads are from inside AWS, and it took about half the time (23 seconds vs 48seconds from ifill). That doesn't include additional time for the EC2 instance to copy the WARC file to S3.

Further data points:

My initial estimate (working only from the previous run of a one month period, and factoring in halving the speed to avoid hogging UMass bandwidth) was that it could take 56 days.

Poking around in the S3 bucket, it looks like the object ID range is about 113 million objects to be scanned; at 50 downloads/second (current historical ingest rate with 6 fetchers), that looks like it could be only 26 days.

So six "packers" (each given a share of the object ID range) running in EC2 at 33 obj/second is 200 obj/second, and 113Mobj divided by 200 obj/sec looks to be about a week of EC2 time.

A t3a.xlarge instance (4 AMD CPUs) is $0.15/hr, which would be $25 for a week (not counting EBS costs for the root disk).

Amazon pricing usually doubles for a doubling in resources, so the total price might be the same for different instance sizes, the instance size just determines the speed (assuming there isn't some other bottleneck).

With the 7400 bytes/obj number from above, at 113M objects, that's 836GB of download to transfer the raw objects,
The WARC file came in at 3555 bytes/object or 402GB to download.

Processing the packed WARC files should be much like any other historical ingest (altho it will require a different stack flavor), and I'd expect that we would be able to process at the same rate (a month every 4 days at 50 stories/second),
so 12 days. The arch-queuer shouldn't need any changes, and it can scan an S3 bucket for new additions, so the pipeline could run at the same time as the EC2 processing.

It looks like we transferred 2TB/mo out of AWS in Sept and October, that puts us in the $0.09/GB bracket, so a savings of 434GB would be $39, and at LEAST at $25 EC2 cost, means a savings of at most $14.

Running ad-hoc packers (as opposed to a rabbitmq based pipeline/stack) has the disadvantage that if the packer processes quit, they wouldn't be able to pick up where they left off without some record keeping. To get the RSS filtering capability we'd need a worker that does just that, or a parser option that says to do ONLY that!

One thing I haven't examined is if how many duplicate stories we might end up with (the canonical URL differs from the final URL we get when downloading using the RSS file URL); I haven't looked at whether we could delete the stories previously fetched using Xavier's CSV files. One way would be to look at the WARC files written when the CSVs were processed, but there might be other ways (looking at indexed_date and published_date?)

I used an ARM64 instance, initially with IPv6 only, running Ubuntu 24.04, and had some "fun":

  • github does not speak IPv6!
  • cchardet didn't want to build on Python 3.12
  • probably some other things I'm forgetting
@philbudne
Copy link
Contributor Author

Copied from a slack message thread:

To see how many object fetched "blind" (by object id) from Feb-April yield canonical URLs that are rejected as duplicate, I ran a batch of 15206 object id's in the production hist-indexer stack. Looking at the log file: 13182 that fetched from S3, were HTML and had a canonical URL, 8443 were rejected as dups (had likely been fetched when we fetched via CSV files Xavier had created from scooping RSS files from the same bucket), and 4723 were "new". Anyone have any thoughts on how to detect/quantify if we'd be getting dups?

Then I did a run of 50K S3 object IDs:

Diffing before and after counts:

 pbudne@tarbell:~/query$ diff -y 2022.[14] | grep '|' | sed 's/                                  //'
2022-01-20 607548      |	2022-01-20 607721
2022-01-21 580131      |	2022-01-21 580322
2022-01-22 389058      |	2022-01-22 389107
2022-01-23 378421      |	2022-01-23 378530
2022-01-24 576418      |	2022-01-24 576932
2022-01-25 426326      |	2022-01-25 433157
2022-01-26 242361      |	2022-01-26 243268
2022-02-08 313719      |	2022-02-08 313721
2022-04-01 392906      |	2022-04-01 392907
2022-05-01 168855      |	2022-05-01 168856
2022-05-04 255443      |	2022-05-04 255444
total 34921806	      |	total 34930585

And looking at the logs, 16939 created, 34370 rejected as duplicate, so 33% non-duplicate. ISTR the estimate was that the "dip" was about 40% of expected levels, so if this adds 50% to that, we'd still be below expected levels, which adds some comfort that it isn't massively duplicative. I'm going to restart 2020 on bernstein (about a week of processing left) to give time for thought & comment.

@philbudne
Copy link
Contributor Author

I took the WARC files generated above (16939 stories NOT rejected as duplicate URLs) and did searches for each one for stories in Elastic with the same canonical_domain and article_title with indexed_date:[2024-05-01 TO 2024-06-30] (the CSV based 2020 back-fill)

Attached is the program:
read.py.txt

and the output
2022.txt

@philbudne
Copy link
Contributor Author

@kilemensi observed that the first two new urls (canonical URLs extracted from S3 documents) resulted in redirects to an "old" URL (URL of a previously fetched story with the same title and canonical domain). Following on this, I ran the above output thru a program that took the canonical URL, tried to open it, took the final URL and processed it with normalize_url and then compared the result with normalize_url on each of the old articles' URLs.

Of the 2903 articles in 2022.txt, 1056 of them matched using the above test.
Removing trailing / from both normalized URLs, the number of matches went up to 1413.

An example of a Story that didn't come up as a match until removing a terminal slash:

canonical https://www.tvc.ru/news/show/id/231331
canonical normalized http://tvc.ru/news/show/id/231331
canonical final https://www.tvc.ru/news/show/id/231331
canonical final normalized http://tvc.ru/news/show/id/231331
old https://www.tvc.ru/news/show/id/231331/?utm_source=news.yandex.ru&utm_content=RSS&utm_campaign=yandex
old normalized http://tvc.ru/news/show/id/231331/

So 8% of the articles inserted using canonical URL (NOT immediately rejected as duplicate) look like they're dups, which isn't lovely (I have all the URLs from the test, and can remove them if need be).

@philbudne
Copy link
Contributor Author

Update: @pgulley @m453h

Fetching downloads_ids 3361000000 thru 3474071196 and processing for canonical_url has been completed (writing directly to backblaze WITHOUT import into elasticsearch). The results are in Backblaze (B2) mediacloud-indexer-archive bucket, with warcfile prefix "hist2022", from 2024/12/14 thru 2024/12/31

Plausible sub-tasks (maybe create separate sub-tasks here in github?):

  1. extraction of URLs to be deleted from ES index [1][2]
  2. estimation of how long it will take to execute deletion
  3. estimation of how long to ingest (using docker/deploy.sh -T archive @file-with-list-of-new-warcs) [2]
  4. determine when/where to perform surgery (in old cluster, before transfer to new cluster? in new cluster?? etc)
  5. perform surgery (deletion and importation)

[1] https://github.com/philbudne/story-indexer/blob/arch-eraser-sketch/indexer/scripts/arch-eraser.py is a sketch of code to read WARC files for the OLD backfill, and extract URLs for deletion. It does not rely on RabbitMQ, and is meant to be runnable outside Docker. The doc string at the top discusses the warc files to be processed (and where they are located). Some of the archives are only stored in Amazon S3 (where we will pay data transit costs to retrieve them). It used to be that S3 retrieval inside of Amazon EC2 was free (no Internet transit cost). If the amount of data to retrieve is large, it might be worth considering running the above script to just extract the URLs in an EC2 instance (would likely be faster, since the round-trip time to retrieve the WARC files could well dominate the runtime).

[2] https://github.com/philbudne/story-indexer/blob/arch-eraser-sketch/indexer/queuer.py#L215 adds code to take an "indirect" file from the command line with a list of input files (which must be full paths/URLs -- not relative to the indirect file location).

@philbudne
Copy link
Contributor Author

@pgulley @m453h the effected date range from #271 (comment) is 2022/01/25 thru 2022/05/08

@philbudne
Copy link
Contributor Author

Here is the original issue for the 2022 backfill, describing the sources:
#271

And my notes on the contents of the ES indices and WARC files:

================ index mc_search-00002 created 2024-03-24T01:07:19.076Z

All files on S3, some on B2 (starting 2024/05/31)

arch prefix	start		end		archives
mccsv(2022)	2024/05/22 -> 2024/06/22	S3/(B2)
mc[rss](2022)	2024/05/27 -> 2024/06/22	S3/(B2) [1]

[1] files named mcrss- start 2024/06/20, prior files are mc- and can be grouped/identified by the host/container name that wrote the WARC files.  The following need (re)verification (fetch one WARC for each container name and verified (I use "zmore" to view the uncompressed contents and skip forward with the "/" command to the first ": metadata") that the "via" field in the "rss_entry" section indicates an RSS file or a CSV file in the S3 mediacloud-database-e-files bucket.

dates			container name (in WARC filename)
2024/05/27-2024/05/28	cf94b52abe5a
2024/05/29-2024/06/04	cefd3fdce464
2024/06/05		0c501ed61cf4
2024/06/05		446d55936e82
2024/06/05		cefd3fdce464
2024/06/06-2024/06/09	0c501ed61cf4
2024/06/09		7e1b47c305f1
2024/06/11-2024/06/20	6c55aaf9daaa

================ mc_search-000003 created 2024-06-22T01:09:36.337Z

arch prefix	start		end		archives
mccsv(2022)	2024/06/22 -> 2024/06/27	S3/(B2)
mcrss(2022)	2024/06/22 -> 2024/08/16	S3/(B2)
mchist2022	2024/06/22 -> 2024/08/06	S3/(B2)

(B2) means some of the date range on B2
(S3) means some of the date range on S3

NOTE! I've omitted all mchist2022 archives: the name means they were generated from CSV files that contained BOTH the original URL and the S3 object ID, and outside the 2022-01-25 thru 2022-05-08 date range

@pgulley pgulley assigned pgulley, philbudne and m453h and unassigned pgulley Jan 30, 2025
@m453h
Copy link
Contributor

m453h commented Feb 1, 2025

Here is the original issue for the 2022 backfill, describing the sources: #271

And my notes on the contents of the ES indices and WARC files:

================ index mc_search-00002 created 2024-03-24T01:07:19.076Z

All files on S3, some on B2 (starting 2024/05/31)

arch prefix	start		end		archives
mccsv(2022)	2024/05/22 -> 2024/06/22	S3/(B2)
mc[rss](2022)	2024/05/27 -> 2024/06/22	S3/(B2) [1]

[1] files named mcrss- start 2024/06/20, prior files are mc- and can be grouped/identified by the host/container name that wrote the WARC files.  The following need (re)verification (fetch one WARC for each container name and verified (I use "zmore" to view the uncompressed contents and skip forward with the "/" command to the first ": metadata") that the "via" field in the "rss_entry" section indicates an RSS file or a CSV file in the S3 mediacloud-database-e-files bucket.

dates			container name (in WARC filename)
2024/05/27-2024/05/28	cf94b52abe5a
2024/05/29-2024/06/04	cefd3fdce464
2024/06/05		0c501ed61cf4
2024/06/05		446d55936e82
2024/06/05		cefd3fdce464
2024/06/06-2024/06/09	0c501ed61cf4
2024/06/09		7e1b47c305f1
2024/06/11-2024/06/20	6c55aaf9daaa

================ mc_search-000003 created 2024-06-22T01:09:36.337Z

arch prefix	start		end		archives
mccsv(2022)	2024/06/22 -> 2024/06/27	S3/(B2)
mcrss(2022)	2024/06/22 -> 2024/08/16	S3/(B2)
mchist2022	2024/06/22 -> 2024/08/06	S3/(B2)

(B2) means some of the date range on B2
(S3) means some of the date range on S3

NOTE! I've omitted all mchist2022 archives: the name means they were generated from CSV files that contained BOTH the original URL and the S3 object ID, and outside the 2022-01-25 thru 2022-05-08 date range

Thanks @philbudne , these notes have been very helpful in getting the WARC files, I have generated lists with the files as per the date ranges that can be examined on the table(s) below:

index mc_search-00002 created 2024-03-24T01:07:19.076Z

arch prefix Start Date End Date B2 List S3 List
mccsv(2022)[1] 2024/05/22 2024/06/22 Download Download
mcrss(2022) 2024/06/20 2024/06/22 Download Download

[1] Started on 2024/05/31 for B2 records

For mc-* files from 2024/05/27 - 2024/06/19 ( I will update on the re-verification of the files per each container below)

dates container name B2 List S3 List
2024/05/27-2024/05/28 cf94b52abe5a No files Download
2024/05/29-2024/06/04 cefd3fdce464 No files Download
2024/06/05 0c501ed61cf4 Download Download
2024/06/05 446d55936e82 Download Download
2024/06/05 cefd3fdce464 No files Download
2024/06/06-2024/06/09 0c501ed61cf4 Download Download
2024/06/09 7e1b47c305f1 Download Download
2024/06/11-2024/06/20 6c55aaf9daaa Download Download

mc_search-000003 created 2024-06-22T01:09:36.337Z

arch prefix Start Date End Date B2 List S3 List
mccsv(2022)* 2024/06/22 2024/06/27 Download Download
mcrss(2022) ** 2024/06/22 2024/08/16 Download Download

* Started from 2024/06/23 since end date overlaps with mccsv(2022) on index mc_search-00002 created 2024-03-24T01:07:19.076Z

** Started from 2024/06/23 since end date overlaps with mccsv(2022) on index mc_search-00002 created 2024-03-24T01:07:19.076Z


@m453h
Copy link
Contributor

m453h commented Feb 3, 2025

Here's an update on the list requiring (re)verification [for mc-* files from 2024/05/27 - 2024/06/19]. I randomly selected one file from each container and checked the via field in the rss_entry section. It appears that all of them are RSS files from S3. I have indicated the value extracted from the via field on the table below:

dates container name B2 List S3 List Extracted via field from B2 List Extracted via field from S3 List
2024/05/27-2024/05/28 cf94b52abe5a No files Download N/A s3://mediacloud-public/daily-rss/mc-2022-04-25.rss
2024/05/29-2024/06/04 cefd3fdce464 No files Download N/A s3://mediacloud-public/daily-rss/mc-2022-04-17.rss
2024/06/05 0c501ed61cf4 Download Download s3://mediacloud-public/daily-rss/mc-2022-04-15.rss s3://mediacloud-public/daily-rss/mc-2022-04-15.rss
2024/06/05 446d55936e82 Download Download s3://mediacloud-public/daily-rss/mc-2022-04-16.rss s3://mediacloud-public/daily-rss/mc-2022-04-16.rss
2024/06/05 cefd3fdce464 No files Download N/A s3://mediacloud-public/daily-rss/mc-2022-04-16.rss
2024/06/06-2024/06/09 0c501ed61cf4 Download Download s3://mediacloud-public/daily-rss/mc-2022-04-14.rss s3://mediacloud-public/daily-rss/mc-2022-04-13.rss
2024/06/09 7e1b47c305f1 Download Download s3://mediacloud-public/daily-rss/mc-2022-04-12.rss s3://mediacloud-public/daily-rss/mc-2022-04-12.rss
2024/06/11-2024/06/20 6c55aaf9daaa Download Download s3://mediacloud-public/daily-rss/mc-2022-04-11.rss s3://mediacloud-public/daily-rss/mc-2022-04-02.rss

@philbudne
Copy link
Contributor Author

@m453h are the files available in the angwin cluster to look at?

@m453h
Copy link
Contributor

m453h commented Feb 24, 2025

@m453h are the files available in the angwin cluster to look at?

@philbudne I have just uploaded the files to the angwin cluster. You can find them in my home folder under the name delete_list.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants