Skip to content

Commit

Permalink
Merge pull request #331 from flairNLP/add_cc_news
Browse files Browse the repository at this point in the history
Add support for CC-NEWS dataset
  • Loading branch information
MaxDall authored Jan 30, 2024
2 parents 76b8e8b + 899c4c9 commit 358d229
Show file tree
Hide file tree
Showing 16 changed files with 613 additions and 23 deletions.
33 changes: 25 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,11 @@ Fundus is:

* **A static news crawler.**
Fundus lets you crawl online news articles with only a few lines of Python code!
Be it from live websites or the CC-NEWS dataset.

* **An open-source Python package.**
Fundus is built on the idea of building something together. We welcome your
contribution to help Fundus [grow](docs/how_to_contribute.md)!
Fundus is built on the idea of building something together.
We welcome your contribution to help Fundus [grow](docs/how_to_contribute.md)!

<hr>

Expand Down Expand Up @@ -82,7 +83,7 @@ Fundus-Article:
- From: FoxNews (2023-05-09 14:37)
```

This printout tells you that you succesfully crawled two articles!
This printout tells you that you successfully crawled two articles!

For each article, the printout details:
- the "Title" of the article, i.e. its headline
Expand All @@ -96,25 +97,41 @@ For each article, the printout details:
Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:

```python

from fundus import PublisherCollection, Crawler

# initialize the crawler for Washington Times
crawler = Crawler(PublisherCollection.us.WashingtonTimes)

# crawl 5 articles and print
# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
print(article)
```

## Example 3: Crawl articles from CC-NEWS

If you're not familiar with CC-NEWS, check out their [paper](https://paperswithcode.com/dataset/cc-news).

````python
from fundus import PublisherCollection, CCNewsCrawler

# initialize the crawler for news publishers based in the US
crawler = CCNewsCrawler(*PublisherCollection.us)

# crawl 2 articles and print
for article in crawler.crawl(max_articles=2):
print(article)
````


## Tutorials

We provide **quick tutorials** to get you started with the library:

1. [**Tutorial 1: How to crawl news with Fundus**](docs/1_getting_started.md)
2. [**Tutorial 2: The Article Class**](docs/2_the_article_class.md)
3. [**Tutorial 3: How to filter articles**](docs/3_how_to_filter_articles.md)
4. [**Tutorial 4: How to search for publishers**](docs/4_how_to_search_for_publishers.md)
2. [**Tutorial 2: How to crawl articles from CC-NEWS**](docs/2_crawl_from_cc_news.md)
3. [**Tutorial 3: The Article Class**](docs/3_the_article_class)
4. [**Tutorial 4: How to filter articles**](docs/4_how_to_filter_articles)
5. [**Tutorial 5: How to search for publishers**](docs/5_how_to_search_for_publishers)

If you wish to contribute check out these tutorials:
1. [**How to contribute**](docs/how_to_contribute.md)
Expand Down
3 changes: 2 additions & 1 deletion docs/1_getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,4 +85,5 @@ for article in crawler.crawl():
print(article)
````

In the [next section](2_the_article_class.md) we will introduce you to the `Article` class.

In the [next](2_crawl_from_cc_news.md) section we will show you how to crawl articles from the CC-NEWS dataset.
72 changes: 72 additions & 0 deletions docs/2_crawl_from_cc_news.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Table of Contents

* [Crawl articles from CC-NEWS](#crawl-articles-from-cc-news)
* [The crawler](#the-crawler)
* [OS start method](#os-start-method)
* [Date range](#date-range)
* [Multiprocessing](#multiprocessing)

# Crawl articles from CC-NEWS

This tutorial explains how to crawl articles from the [CC-NEWS](https://paperswithcode.com/dataset/cc-news) dataset using Fundus.

## The crawler

To crawl articles from CC-NEWS simply import the `CCNewsCrawler` and stick to the same schema as with the main Fundus crawler.
Now let's crawl a bunch of news articles from CC-NEWS using all available publishers supported in the Fundus `PublisherCollection`.

````python
from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(max_articles=100):
print(article)
````

### OS start method
Depending on the process [start method](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) used by your OS, you may have to wrap this crawl with a `__name__ == "__main__"` block.

````python
from fundus import CCNewsCrawler, PublisherCollection

if __name__ == "__main__":
crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(max_articles=100):
print(article)
````

This code will crawl 100 random articles from the entire date range of the CC-NEWS dataset.

## Date range

Date range you may ask?
Yes, you can specify a date range corresponding to the date the article was added to CC-NEWS.
Let's crawl some articles that were crawled between 2020/01/01 and 2020/03/03.

````python
from datetime import datetime

from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection)
for article in crawler.crawl(start=datetime(2020, 1, 1), end=datetime(2020, 3, 1), max_articles=100):
print(article)
````

## Multiprocessing

The CC-NEWS dataset consists of multiple terabytes of articles.
Due to the sheer amount of data, the crawler utilizes multiple processes.
Per default, it uses all CPUs available in your system.
You can alter the number of additional processes used for crawling with the `processes` parameter of `CCNewsCrawler`.

````python
from fundus import CCNewsCrawler, PublisherCollection

crawler = CCNewsCrawler(*PublisherCollection, processes=4)
````

To omit multiprocessing, pass `0` to the `processes` parameter.

In the [next section](3_the_article_class) we will introduce you to the `Article` class.

4 changes: 2 additions & 2 deletions docs/2_the_article_class.md → docs/3_the_article_class.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ You can find those attributes under the [**supported publisher**](supported_publ

Sometimes an attribute listed in the attribute guidelines isn't supported at all by a specific parser.
You can find this information under the `Missing Attributes` tab within the supported publisher tables.
There is also a built-in search mechanic you can learn about [here](4_how_to_search_for_publishers.md)
There is also a built-in search mechanic you can learn about [here](5_how_to_search_for_publishers)

## The articles' body

Expand Down Expand Up @@ -137,4 +137,4 @@ Should print this:
en
``

In the [**next section**](3_how_to_filter_articles.md) we will show you how to filter articles.
In the [**next section**](4_how_to_filter_articles) we will show you how to filter articles.
Original file line number Diff line number Diff line change
Expand Up @@ -196,4 +196,4 @@ crawler = Crawler(PublisherCollection.us, restrict_sources_to=[NewsMap])
The `crawl()` method supports functionality to filter out articles with URLs previously encountered in this run.
You can alter this behavior by setting the `only_unique` parameter.

In the [next section](4_how_to_search_for_publishers.md) we will show you how to search through publishers in the `PublisherCollection`.
In the [next section](5_how_to_search_for_publishers) we will show you how to search through publishers in the `PublisherCollection`.
File renamed without changes.
2 changes: 1 addition & 1 deletion docs/how_to_add_a_publisher.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ Fundus provides the following types of `URLSource`, which you can import from `f

Fundus distinguishes between these source types to facilitate crawling only recent articles (`RSSFeed`, `NewsMap`) or an entire website (`Sitemap`).
This differentiation is mainly for efficiency reasons.
Refer to [this](3_how_to_filter_articles.md#filter-sources) documentation on how to filter for different source types.
Refer to [this](4_how_to_filter_articles#filter-sources) documentation on how to filter for different source types.

**_NOTE:_** When adding a new publisher, it is recommended to specify at least one `Sitemap` and one `RSSFeed` or `NewsMap` (preferred).
If your publisher provides a `NewsFeed`, there is no need to specify an `RSSFeed`.
Expand Down
5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ dependencies = [
"aiohttp~=3.8.4",
"aioitertools~=0.11.0",
"validators~=0.20.0",
"requests~=2.28.2",
"tqdm~=4.66.1",
"fastwarc~=0.14.5",
"chardet~=5.2.0",
"dill~=0.3.7"
]

[project.urls]
Expand Down
12 changes: 11 additions & 1 deletion src/fundus/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,24 @@
import sys

from fundus.publishers import PublisherCollection
from fundus.scraping.common_crawl import CCNewsCrawler
from fundus.scraping.filter import Requires
from fundus.scraping.html import NewsMap, RSSFeed, Sitemap
from fundus.scraping.pipeline import BaseCrawler, Crawler

__module_path__ = pathlib.Path(__file__).parent
__development_base_path__ = __module_path__.parents[1]

__all__ = ["Crawler", "BaseCrawler", "PublisherCollection", "Requires", "RSSFeed", "Sitemap", "NewsMap"]
__all__ = [
"Crawler",
"BaseCrawler",
"CCNewsCrawler",
"PublisherCollection",
"Requires",
"RSSFeed",
"Sitemap",
"NewsMap",
]

# On a Windows machines, when executing `BaseCrawler.crawl` from our sync API two times,
# Python throws an `RuntimeError: Event loop is closed exception` during Python's clean-up phase.
Expand Down
7 changes: 4 additions & 3 deletions src/fundus/publishers/base_objects.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

from fundus.parser.base_parser import ParserProxy
from fundus.scraping.filter import URLFilter
from fundus.scraping.html import HTMLSource, NewsMap, RSSFeed, Sitemap, URLSource
from fundus.scraping.html import FundusSource, NewsMap, RSSFeed, Sitemap, URLSource
from fundus.utils.iteration import iterate_all_subclasses


Expand Down Expand Up @@ -33,10 +33,11 @@ def __init__(self, spec: PublisherSpec):
self.domain = spec.domain
self.parser = spec.parser()
self.publisher_name = spec.name
self.url_filter = spec.url_filter

# we define the dict here manually instead of using default dict so that we can control
# the order in which sources are proceeded.
source_mapping: Dict[Type[URLSource], List[HTMLSource]] = {
source_mapping: Dict[Type[URLSource], List[FundusSource]] = {
RSSFeed: [],
NewsMap: [],
Sitemap: [],
Expand All @@ -48,7 +49,7 @@ def __init__(self, spec: PublisherSpec):
f"Unexpected type '{type(url_source).__name__}' as source for {self.name}. "
f"Allowed are '{', '.join(cls.__name__ for cls in iterate_all_subclasses(URLSource))}'"
)
source: HTMLSource = HTMLSource(
source: FundusSource = FundusSource(
url_source=url_source,
publisher=self.publisher_name,
url_filter=spec.url_filter,
Expand Down
3 changes: 3 additions & 0 deletions src/fundus/scraping/common_crawl/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .pipeline import CCNewsCrawler

__all__ = ["CCNewsCrawler"]
92 changes: 92 additions & 0 deletions src/fundus/scraping/common_crawl/html.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
from typing import Dict, Iterator, Optional
from urllib.parse import urlparse

import chardet
import requests
from fastwarc import ArchiveIterator, WarcRecord, WarcRecordType

from fundus.logging import basic_logger
from fundus.publishers.base_objects import PublisherEnum
from fundus.scraping.filter import URLFilter
from fundus.scraping.html import HTML, WarcSource, _default_header


class CCNewsSource:
def __init__(self, *publishers: PublisherEnum, warc_path: str, headers: Optional[Dict[str, str]] = None):
self.publishers = publishers
self.warc_path = warc_path
self.headers = headers or _default_header

self._publisher_mapping: Dict[str, PublisherEnum] = {
urlparse(publisher.domain).netloc: publisher for publisher in publishers
}

def fetch(self, url_filter: Optional[URLFilter] = None) -> Iterator[HTML]:
def extract_content(record: WarcRecord) -> Optional[str]:
warc_body: bytes = record.reader.read()

try:
return str(warc_body, encoding=record.http_charset)
except (UnicodeDecodeError, TypeError):
encoding: Optional[str] = chardet.detect(warc_body)["encoding"]

if encoding is not None:
basic_logger.debug(
f"Trying to decode record {record.record_id!r} from {target_url!r} "
f"using detected encoding {encoding}."
)

try:
return str(warc_body, encoding=encoding)
except UnicodeDecodeError:
basic_logger.warning(
f"Couldn't decode record {record.record_id!r} from {target_url!r} with "
f"original charset {record.http_charset!r} using detected charset {encoding!r}."
)
else:
basic_logger.warning(
f"Couldn't detect charset for record {record.record_id!r} from {target_url!r} "
f"with invalid original charset {record.http_charset!r}."
)

return None

with requests.Session() as session:
stream = session.get(self.warc_path, stream=True, headers=self.headers).raw

for warc_record in ArchiveIterator(stream, record_types=WarcRecordType.response, verify_digests=True):
target_url = str(warc_record.headers["WARC-Target-URI"])

if url_filter is not None and url_filter(target_url):
basic_logger.debug(f"Skipped WARC record with target URI {target_url!r} because of URL filter")
continue

publisher_domain: str = urlparse(target_url).netloc

if publisher_domain not in self._publisher_mapping:
continue

publisher = self._publisher_mapping[publisher_domain]

if publisher.url_filter is not None and publisher.url_filter(target_url):
basic_logger.debug(
f"Skipped WARC record with target URI {target_url!r} because of "
f"publisher specific URL filter"
)
continue

if (content := extract_content(warc_record)) is None:
continue

yield HTML(
requested_url=target_url,
responded_url=target_url,
content=content,
crawl_date=warc_record.record_date,
source=WarcSource(
publisher=publisher.publisher_name,
warc_path=self.warc_path,
warc_headers=dict(warc_record.headers),
http_headers=dict(warc_record.http_headers),
),
)
Loading

0 comments on commit 358d229

Please sign in to comment.