-
Notifications
You must be signed in to change notification settings - Fork 83
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #331 from flairNLP/add_cc_news
Add support for CC-NEWS dataset
- Loading branch information
Showing
16 changed files
with
613 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# Table of Contents | ||
|
||
* [Crawl articles from CC-NEWS](#crawl-articles-from-cc-news) | ||
* [The crawler](#the-crawler) | ||
* [OS start method](#os-start-method) | ||
* [Date range](#date-range) | ||
* [Multiprocessing](#multiprocessing) | ||
|
||
# Crawl articles from CC-NEWS | ||
|
||
This tutorial explains how to crawl articles from the [CC-NEWS](https://paperswithcode.com/dataset/cc-news) dataset using Fundus. | ||
|
||
## The crawler | ||
|
||
To crawl articles from CC-NEWS simply import the `CCNewsCrawler` and stick to the same schema as with the main Fundus crawler. | ||
Now let's crawl a bunch of news articles from CC-NEWS using all available publishers supported in the Fundus `PublisherCollection`. | ||
|
||
````python | ||
from fundus import CCNewsCrawler, PublisherCollection | ||
|
||
crawler = CCNewsCrawler(*PublisherCollection) | ||
for article in crawler.crawl(max_articles=100): | ||
print(article) | ||
```` | ||
|
||
### OS start method | ||
Depending on the process [start method](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) used by your OS, you may have to wrap this crawl with a `__name__ == "__main__"` block. | ||
|
||
````python | ||
from fundus import CCNewsCrawler, PublisherCollection | ||
|
||
if __name__ == "__main__": | ||
crawler = CCNewsCrawler(*PublisherCollection) | ||
for article in crawler.crawl(max_articles=100): | ||
print(article) | ||
```` | ||
|
||
This code will crawl 100 random articles from the entire date range of the CC-NEWS dataset. | ||
|
||
## Date range | ||
|
||
Date range you may ask? | ||
Yes, you can specify a date range corresponding to the date the article was added to CC-NEWS. | ||
Let's crawl some articles that were crawled between 2020/01/01 and 2020/03/03. | ||
|
||
````python | ||
from datetime import datetime | ||
|
||
from fundus import CCNewsCrawler, PublisherCollection | ||
|
||
crawler = CCNewsCrawler(*PublisherCollection) | ||
for article in crawler.crawl(start=datetime(2020, 1, 1), end=datetime(2020, 3, 1), max_articles=100): | ||
print(article) | ||
```` | ||
|
||
## Multiprocessing | ||
|
||
The CC-NEWS dataset consists of multiple terabytes of articles. | ||
Due to the sheer amount of data, the crawler utilizes multiple processes. | ||
Per default, it uses all CPUs available in your system. | ||
You can alter the number of additional processes used for crawling with the `processes` parameter of `CCNewsCrawler`. | ||
|
||
````python | ||
from fundus import CCNewsCrawler, PublisherCollection | ||
|
||
crawler = CCNewsCrawler(*PublisherCollection, processes=4) | ||
```` | ||
|
||
To omit multiprocessing, pass `0` to the `processes` parameter. | ||
|
||
In the [next section](3_the_article_class) we will introduce you to the `Article` class. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from .pipeline import CCNewsCrawler | ||
|
||
__all__ = ["CCNewsCrawler"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
from typing import Dict, Iterator, Optional | ||
from urllib.parse import urlparse | ||
|
||
import chardet | ||
import requests | ||
from fastwarc import ArchiveIterator, WarcRecord, WarcRecordType | ||
|
||
from fundus.logging import basic_logger | ||
from fundus.publishers.base_objects import PublisherEnum | ||
from fundus.scraping.filter import URLFilter | ||
from fundus.scraping.html import HTML, WarcSource, _default_header | ||
|
||
|
||
class CCNewsSource: | ||
def __init__(self, *publishers: PublisherEnum, warc_path: str, headers: Optional[Dict[str, str]] = None): | ||
self.publishers = publishers | ||
self.warc_path = warc_path | ||
self.headers = headers or _default_header | ||
|
||
self._publisher_mapping: Dict[str, PublisherEnum] = { | ||
urlparse(publisher.domain).netloc: publisher for publisher in publishers | ||
} | ||
|
||
def fetch(self, url_filter: Optional[URLFilter] = None) -> Iterator[HTML]: | ||
def extract_content(record: WarcRecord) -> Optional[str]: | ||
warc_body: bytes = record.reader.read() | ||
|
||
try: | ||
return str(warc_body, encoding=record.http_charset) | ||
except (UnicodeDecodeError, TypeError): | ||
encoding: Optional[str] = chardet.detect(warc_body)["encoding"] | ||
|
||
if encoding is not None: | ||
basic_logger.debug( | ||
f"Trying to decode record {record.record_id!r} from {target_url!r} " | ||
f"using detected encoding {encoding}." | ||
) | ||
|
||
try: | ||
return str(warc_body, encoding=encoding) | ||
except UnicodeDecodeError: | ||
basic_logger.warning( | ||
f"Couldn't decode record {record.record_id!r} from {target_url!r} with " | ||
f"original charset {record.http_charset!r} using detected charset {encoding!r}." | ||
) | ||
else: | ||
basic_logger.warning( | ||
f"Couldn't detect charset for record {record.record_id!r} from {target_url!r} " | ||
f"with invalid original charset {record.http_charset!r}." | ||
) | ||
|
||
return None | ||
|
||
with requests.Session() as session: | ||
stream = session.get(self.warc_path, stream=True, headers=self.headers).raw | ||
|
||
for warc_record in ArchiveIterator(stream, record_types=WarcRecordType.response, verify_digests=True): | ||
target_url = str(warc_record.headers["WARC-Target-URI"]) | ||
|
||
if url_filter is not None and url_filter(target_url): | ||
basic_logger.debug(f"Skipped WARC record with target URI {target_url!r} because of URL filter") | ||
continue | ||
|
||
publisher_domain: str = urlparse(target_url).netloc | ||
|
||
if publisher_domain not in self._publisher_mapping: | ||
continue | ||
|
||
publisher = self._publisher_mapping[publisher_domain] | ||
|
||
if publisher.url_filter is not None and publisher.url_filter(target_url): | ||
basic_logger.debug( | ||
f"Skipped WARC record with target URI {target_url!r} because of " | ||
f"publisher specific URL filter" | ||
) | ||
continue | ||
|
||
if (content := extract_content(warc_record)) is None: | ||
continue | ||
|
||
yield HTML( | ||
requested_url=target_url, | ||
responded_url=target_url, | ||
content=content, | ||
crawl_date=warc_record.record_date, | ||
source=WarcSource( | ||
publisher=publisher.publisher_name, | ||
warc_path=self.warc_path, | ||
warc_headers=dict(warc_record.headers), | ||
http_headers=dict(warc_record.http_headers), | ||
), | ||
) |
Oops, something went wrong.