Merge pull request #331 from flairNLP/add_cc_news

Add support for CC-NEWS dataset
flairNLP · Jan 30, 2024 · 358d229 · 358d229
2 parents 76b8e8b + 899c4c9
commit 358d229
Show file tree

Hide file tree

Showing 16 changed files with 613 additions and 23 deletions.
diff --git a/README.md b/README.md
@@ -29,10 +29,11 @@ Fundus is:
 
 * **A static news crawler.** 
   Fundus lets you crawl online news articles with only a few lines of Python code!
+  Be it from live websites or the CC-NEWS dataset.
 
 * **An open-source Python package.**
-  Fundus is built on the idea of building something together. We welcome your
-  contribution to  help Fundus [grow](docs/how_to_contribute.md)!
+  Fundus is built on the idea of building something together. 
+  We welcome your contribution to  help Fundus [grow](docs/how_to_contribute.md)!
 
 <hr>
 
@@ -82,7 +83,7 @@ Fundus-Article:
 - From:   FoxNews (2023-05-09 14:37)
 ```
 
-This printout tells you that you succesfully crawled two articles!
+This printout tells you that you successfully crawled two articles!
 
 For each article, the printout details:
 - the "Title" of the article, i.e. its headline 
@@ -96,25 +97,41 @@ For each article, the printout details:
 Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:
 
 ```python
-
 from fundus import PublisherCollection, Crawler
 
 # initialize the crawler for Washington Times
 crawler = Crawler(PublisherCollection.us.WashingtonTimes)
 
-# crawl 5 articles and print
+# crawl 2 articles and print
 for article in crawler.crawl(max_articles=2):
     print(article)
 ```
 
+## Example 3: Crawl articles from CC-NEWS
+
+If you're not familiar with CC-NEWS, check out their [paper](https://paperswithcode.com/dataset/cc-news).
+
+````python
+from fundus import PublisherCollection, CCNewsCrawler
+
+# initialize the crawler for news publishers based in the US
+crawler = CCNewsCrawler(*PublisherCollection.us)
+
+# crawl 2 articles and print
+for article in crawler.crawl(max_articles=2):
+  print(article)
+````
+
+
 ## Tutorials
 
 We provide **quick tutorials** to get you started with the library:
 
 1. [**Tutorial 1: How to crawl news with Fundus**](docs/1_getting_started.md)
-2. [**Tutorial 2: The Article Class**](docs/2_the_article_class.md)
-3. [**Tutorial 3: How to filter articles**](docs/3_how_to_filter_articles.md)
-4. [**Tutorial 4: How to search for publishers**](docs/4_how_to_search_for_publishers.md)
+2. [**Tutorial 2: How to crawl articles from CC-NEWS**](docs/2_crawl_from_cc_news.md)
+3. [**Tutorial 3: The Article Class**](docs/3_the_article_class)
+4. [**Tutorial 4: How to filter articles**](docs/4_how_to_filter_articles)
+5. [**Tutorial 5: How to search for publishers**](docs/5_how_to_search_for_publishers)
 
 If you wish to contribute check out these tutorials:
 1. [**How to contribute**](docs/how_to_contribute.md)

diff --git a/docs/1_getting_started.md b/docs/1_getting_started.md
@@ -85,4 +85,5 @@ for article in crawler.crawl():
     print(article)
 ````
 
-In the [next section](2_the_article_class.md) we will introduce you to the `Article` class.
+
+In the [next](2_crawl_from_cc_news.md) section we will show you how to crawl articles from the CC-NEWS dataset.
diff --git a/docs/2_crawl_from_cc_news.md b/docs/2_crawl_from_cc_news.md
@@ -0,0 +1,72 @@
+# Table of Contents
+
+* [Crawl articles from CC-NEWS](#crawl-articles-from-cc-news)
+  * [The crawler](#the-crawler)
+    * [OS start method](#os-start-method)
+  * [Date range](#date-range)
+  * [Multiprocessing](#multiprocessing)
+
+# Crawl articles from CC-NEWS
+
+This tutorial explains how to crawl articles from the [CC-NEWS](https://paperswithcode.com/dataset/cc-news) dataset using Fundus.
+
+## The crawler
+
+To crawl articles from CC-NEWS simply import the `CCNewsCrawler` and stick to the same schema as with the main Fundus crawler.
+Now let's crawl a bunch of news articles from CC-NEWS using all available publishers supported in the Fundus `PublisherCollection`.
+
+````python
+from fundus import CCNewsCrawler, PublisherCollection
+
+crawler = CCNewsCrawler(*PublisherCollection)
+for article in crawler.crawl(max_articles=100):
+    print(article)
+````
+
+### OS start method
+Depending on the process [start method](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) used by your OS, you may have to wrap this crawl with a `__name__ == "__main__"` block.
+
+````python
+from fundus import CCNewsCrawler, PublisherCollection
+
+if __name__ == "__main__":
+    crawler = CCNewsCrawler(*PublisherCollection)
+    for article in crawler.crawl(max_articles=100):
+        print(article)
+````
+
+This code will crawl 100 random articles from the entire date range of the CC-NEWS dataset.
+
+## Date range
+
+Date range you may ask?
+Yes, you can specify a date range corresponding to the date the article was added to CC-NEWS.
+Let's crawl some articles that were crawled between 2020/01/01 and 2020/03/03.
+
+````python
+from datetime import datetime
+
+from fundus import CCNewsCrawler, PublisherCollection
+
+crawler = CCNewsCrawler(*PublisherCollection)
+for article in crawler.crawl(start=datetime(2020, 1, 1), end=datetime(2020, 3, 1), max_articles=100):
+    print(article)
+````
+
+## Multiprocessing
+
+The CC-NEWS dataset consists of multiple terabytes of articles.
+Due to the sheer amount of data, the crawler utilizes multiple processes.
+Per default, it uses all CPUs available in your system.
+You can alter the number of additional processes used for crawling with the `processes` parameter of `CCNewsCrawler`.
+
+````python
+from fundus import CCNewsCrawler, PublisherCollection
+
+crawler = CCNewsCrawler(*PublisherCollection, processes=4)
+````
+
+To omit multiprocessing, pass `0` to the `processes` parameter.
+
+In the [next section](3_the_article_class) we will introduce you to the `Article` class.
+
diff --git a/docs/2_the_article_class.md → docs/3_the_article_class.md b/docs/2_the_article_class.md → docs/3_the_article_class.md
@@ -45,7 +45,7 @@ You can find those attributes under the [**supported publisher**](supported_publ
 
 Sometimes an attribute listed in the attribute guidelines isn't supported at all by a specific parser.
 You can find this information under the `Missing Attributes` tab within the supported publisher tables.
-There is also a built-in search mechanic you can learn about [here](4_how_to_search_for_publishers.md)
+There is also a built-in search mechanic you can learn about [here](5_how_to_search_for_publishers)
 
 ## The articles' body
 
@@ -137,4 +137,4 @@ Should print this:
 en
 ``
 
-In the [**next section**](3_how_to_filter_articles.md) we will show you how to filter articles.
+In the [**next section**](4_how_to_filter_articles) we will show you how to filter articles.
diff --git a/docs/3_how_to_filter_articles.md → docs/4_how_to_filter_articles.md b/docs/3_how_to_filter_articles.md → docs/4_how_to_filter_articles.md
@@ -196,4 +196,4 @@ crawler = Crawler(PublisherCollection.us, restrict_sources_to=[NewsMap])
 The `crawl()` method supports functionality to filter out articles with URLs previously encountered in this run.
 You can alter this behavior by setting the `only_unique` parameter.
 
-In the [next section](4_how_to_search_for_publishers.md) we will show you how to search through publishers in the `PublisherCollection`.
+In the [next section](5_how_to_search_for_publishers) we will show you how to search through publishers in the `PublisherCollection`.
diff --git a/docs/4_how_to_search_for_publishers.md → docs/5_how_to_search_for_publishers.md b/docs/4_how_to_search_for_publishers.md → docs/5_how_to_search_for_publishers.md
diff --git a/docs/how_to_add_a_publisher.md b/docs/how_to_add_a_publisher.md
@@ -92,7 +92,7 @@ Fundus provides the following types of `URLSource`, which you can import from `f
 
 Fundus distinguishes between these source types to facilitate crawling only recent articles (`RSSFeed`, `NewsMap`) or an entire website (`Sitemap`).
 This differentiation is mainly for efficiency reasons.
-Refer to [this](3_how_to_filter_articles.md#filter-sources) documentation on how to filter for different source types.
+Refer to [this](4_how_to_filter_articles#filter-sources) documentation on how to filter for different source types.
 
 **_NOTE:_** When adding a new publisher, it is recommended to specify at least one `Sitemap` and one `RSSFeed` or `NewsMap` (preferred).
 If your publisher provides a `NewsFeed`, there is no need to specify an `RSSFeed`.

diff --git a/pyproject.toml b/pyproject.toml
@@ -34,6 +34,11 @@ dependencies = [
     "aiohttp~=3.8.4",
     "aioitertools~=0.11.0",
     "validators~=0.20.0",
+    "requests~=2.28.2",
+    "tqdm~=4.66.1",
+    "fastwarc~=0.14.5",
+    "chardet~=5.2.0",
+    "dill~=0.3.7"
 ]
 
 [project.urls]

diff --git a/src/fundus/__init__.py b/src/fundus/__init__.py
@@ -2,14 +2,24 @@
 import sys
 
 from fundus.publishers import PublisherCollection
+from fundus.scraping.common_crawl import CCNewsCrawler
 from fundus.scraping.filter import Requires
 from fundus.scraping.html import NewsMap, RSSFeed, Sitemap
 from fundus.scraping.pipeline import BaseCrawler, Crawler
 
 __module_path__ = pathlib.Path(__file__).parent
 __development_base_path__ = __module_path__.parents[1]
 
-__all__ = ["Crawler", "BaseCrawler", "PublisherCollection", "Requires", "RSSFeed", "Sitemap", "NewsMap"]
+__all__ = [
+    "Crawler",
+    "BaseCrawler",
+    "CCNewsCrawler",
+    "PublisherCollection",
+    "Requires",
+    "RSSFeed",
+    "Sitemap",
+    "NewsMap",
+]
 
 # On a Windows machines, when executing `BaseCrawler.crawl` from our sync API two times,
 # Python throws an `RuntimeError: Event loop is closed exception` during Python's clean-up phase.

diff --git a/src/fundus/publishers/base_objects.py b/src/fundus/publishers/base_objects.py
@@ -5,7 +5,7 @@
 
 from fundus.parser.base_parser import ParserProxy
 from fundus.scraping.filter import URLFilter
-from fundus.scraping.html import HTMLSource, NewsMap, RSSFeed, Sitemap, URLSource
+from fundus.scraping.html import FundusSource, NewsMap, RSSFeed, Sitemap, URLSource
 from fundus.utils.iteration import iterate_all_subclasses
 
 
@@ -33,10 +33,11 @@ def __init__(self, spec: PublisherSpec):
         self.domain = spec.domain
         self.parser = spec.parser()
         self.publisher_name = spec.name
+        self.url_filter = spec.url_filter
 
         # we define the dict here manually instead of using default dict so that we can control
         # the order in which sources are proceeded.
-        source_mapping: Dict[Type[URLSource], List[HTMLSource]] = {
+        source_mapping: Dict[Type[URLSource], List[FundusSource]] = {
             RSSFeed: [],
             NewsMap: [],
             Sitemap: [],
@@ -48,7 +49,7 @@ def __init__(self, spec: PublisherSpec):
                     f"Unexpected type '{type(url_source).__name__}' as source for {self.name}. "
                     f"Allowed are '{', '.join(cls.__name__ for cls in iterate_all_subclasses(URLSource))}'"
                 )
-            source: HTMLSource = HTMLSource(
+            source: FundusSource = FundusSource(
                 url_source=url_source,
                 publisher=self.publisher_name,
                 url_filter=spec.url_filter,

diff --git a/src/fundus/scraping/common_crawl/__init__.py b/src/fundus/scraping/common_crawl/__init__.py
@@ -0,0 +1,3 @@
+from .pipeline import CCNewsCrawler
+
+__all__ = ["CCNewsCrawler"]
diff --git a/src/fundus/scraping/common_crawl/html.py b/src/fundus/scraping/common_crawl/html.py
@@ -0,0 +1,92 @@
+from typing import Dict, Iterator, Optional
+from urllib.parse import urlparse
+
+import chardet
+import requests
+from fastwarc import ArchiveIterator, WarcRecord, WarcRecordType
+
+from fundus.logging import basic_logger
+from fundus.publishers.base_objects import PublisherEnum
+from fundus.scraping.filter import URLFilter
+from fundus.scraping.html import HTML, WarcSource, _default_header
+
+
+class CCNewsSource:
+    def __init__(self, *publishers: PublisherEnum, warc_path: str, headers: Optional[Dict[str, str]] = None):
+        self.publishers = publishers
+        self.warc_path = warc_path
+        self.headers = headers or _default_header
+
+        self._publisher_mapping: Dict[str, PublisherEnum] = {
+            urlparse(publisher.domain).netloc: publisher for publisher in publishers
+        }
+
+    def fetch(self, url_filter: Optional[URLFilter] = None) -> Iterator[HTML]:
+        def extract_content(record: WarcRecord) -> Optional[str]:
+            warc_body: bytes = record.reader.read()
+
+            try:
+                return str(warc_body, encoding=record.http_charset)
+            except (UnicodeDecodeError, TypeError):
+                encoding: Optional[str] = chardet.detect(warc_body)["encoding"]
+
+                if encoding is not None:
+                    basic_logger.debug(
+                        f"Trying to decode record {record.record_id!r} from {target_url!r} "
+                        f"using detected encoding {encoding}."
+                    )
+
+                    try:
+                        return str(warc_body, encoding=encoding)
+                    except UnicodeDecodeError:
+                        basic_logger.warning(
+                            f"Couldn't decode record {record.record_id!r} from {target_url!r} with "
+                            f"original charset {record.http_charset!r} using detected charset {encoding!r}."
+                        )
+                else:
+                    basic_logger.warning(
+                        f"Couldn't detect charset for record {record.record_id!r} from {target_url!r} "
+                        f"with invalid original charset {record.http_charset!r}."
+                    )
+
+            return None
+
+        with requests.Session() as session:
+            stream = session.get(self.warc_path, stream=True, headers=self.headers).raw
+
+            for warc_record in ArchiveIterator(stream, record_types=WarcRecordType.response, verify_digests=True):
+                target_url = str(warc_record.headers["WARC-Target-URI"])
+
+                if url_filter is not None and url_filter(target_url):
+                    basic_logger.debug(f"Skipped WARC record with target URI {target_url!r} because of URL filter")
+                    continue
+
+                publisher_domain: str = urlparse(target_url).netloc
+
+                if publisher_domain not in self._publisher_mapping:
+                    continue
+
+                publisher = self._publisher_mapping[publisher_domain]
+
+                if publisher.url_filter is not None and publisher.url_filter(target_url):
+                    basic_logger.debug(
+                        f"Skipped WARC record with target URI {target_url!r} because of "
+                        f"publisher specific URL filter"
+                    )
+                    continue
+
+                if (content := extract_content(warc_record)) is None:
+                    continue
+
+                yield HTML(
+                    requested_url=target_url,
+                    responded_url=target_url,
+                    content=content,
+                    crawl_date=warc_record.record_date,
+                    source=WarcSource(
+                        publisher=publisher.publisher_name,
+                        warc_path=self.warc_path,
+                        warc_headers=dict(warc_record.headers),
+                        http_headers=dict(warc_record.http_headers),
+                    ),
+                )
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from .pipeline import CCNewsCrawler

		__all__ = ["CCNewsCrawler"]