Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add free access atrribute #362

Merged
merged 16 commits into from
Feb 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/attribute_guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,4 +58,12 @@ Those attributes will be validated with unit tests when used.
<td><code>List[str]</code></td>
<td><code>generic_topic_parsing</code></td>
</tr>
<tr>
<td>free_access</td>
<td>A boolean which is set to be False, if the article is restricted to users with a subscription. This usually indicates
that the article cannot be crawled completely.
<i><b>This attribute is implemented by default</b></i></td>
<td><code>bool</code></td>
<td><code></code></td>
</tr>
</table>
18 changes: 18 additions & 0 deletions docs/how_to_add_a_publisher.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
* [Working with `lxml`](#working-with-lxml)
* [CSS-Select](#css-select)
* [XPath](#xpath)
* [Checking the free_access attribute](#checking-the-free_access-attribute)
* [Finishing the Parser](#finishing-the-parser)
* [6. Generate unit tests](#6-generate-unit-tests)
* [7. Opening a Pull Request](#7-opening-a-pull-request)
Expand Down Expand Up @@ -469,6 +470,23 @@ Instead, we recommend referring to [this](https://devhints.io/xpath) documentati
Make sure to examine other parsers and consult the [attribute guidelines](attribute_guidelines.md) for specifics on attribute implementation.
We strongly encourage utilizing these utility functions, especially when parsing the `ArticleBody`.

### Checking the free_access attribute

In case your new publisher does not have a subscription model, you can go ahead and skip this step.
If it does, please verify that there is a tag `isAccessibleForFree` within the HTMLs `ld+json` elements (refer to the section [Extracting attributes from Precomputed](#extracting-attributes-from-precomputed) for details) in the source code of premium articles that is set to either `false` or `False`, `true`/`True` respectively.
It doesn't matter if the tag is missing in the freely accessible articles.
If this is the case, you can continue with the next step. If not, please overwrite the existing function by adding the following snippet to your parser:

```python
@attribute
def free_access(self) -> bool:
# Your personalized logic goes here
...
```

Usually you can identify a premium article by an indicator within the URL or by using XPath or CSSSelector and selecting
the element asking to to purchase a subscription to view the article.

### Finishing the Parser

Bringing all the above together, the Los Angeles Times now looks like this.
Expand Down
4 changes: 1 addition & 3 deletions docs/supported_publishers.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,9 +91,7 @@
</a>
</td>
<td>&#160;</td>
<td>
<code>free_access</code>
</td>
<td>&#160;</td>
</tr>
<tr>
<td>
Expand Down
9 changes: 9 additions & 0 deletions src/fundus/parser/base_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,15 @@ def __meta(self) -> Dict[str, Any]:
def __ld(self) -> Optional[LinkedDataMapping]:
return self.precomputed.ld

@attribute
def free_access(self) -> bool:
if (isAccessibleForFree := self.precomputed.ld.bf_search("isAccessibleForFree")) is None:
return True
elif not isAccessibleForFree or isAccessibleForFree == "false" or isAccessibleForFree == "False":
return False
else:
return True


class _ParserCache:
def __init__(self, factory: Type[BaseParser]):
Expand Down
8 changes: 8 additions & 0 deletions src/fundus/publishers/de/bild.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import datetime
import re
from typing import List, Optional

from lxml.etree import XPath
Expand Down Expand Up @@ -42,3 +43,10 @@ def title(self) -> Optional[str]:
@attribute
def topics(self) -> List[str]:
return generic_topic_parsing(self.precomputed.meta.get("keywords"))

@attribute
def free_access(self) -> bool:
if (url := self.precomputed.meta.get("og:url")) is not None:
return re.search(r"/bild-plus/", url) is None
else:
return True
4 changes: 0 additions & 4 deletions src/fundus/publishers/de/braunschweiger_zeitung.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,3 @@ def authors(self) -> List[str]:
@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
return generic_date_parsing(self.precomputed.ld.bf_search("datePublished"))

@attribute(validate=False)
def free_access(self) -> bool:
return self.precomputed.ld.bf_search("isAccessibleForFree") == "True"
12 changes: 6 additions & 6 deletions tests/test_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,10 +47,10 @@ def test_functions_iter(self, parser_with_function_test, parser_with_static_meth
assert parser_with_function_test.functions().names == ["test"]

def test_attributes_iter(self, parser_with_attr_title, parser_with_static_method):
assert len(BaseParser.attributes()) == 0
assert len(parser_with_static_method.attributes()) == 0
assert len(parser_with_attr_title.attributes()) == 1
assert parser_with_attr_title.attributes().names == ["title"]
assert len(BaseParser.attributes()) == 1
assert len(parser_with_static_method.attributes()) == 1
assert len(parser_with_attr_title.attributes()) == 2
assert parser_with_attr_title.attributes().names == ["free_access", "title"]

def test_supported_unsupported(self):
class ParserWithValidatedAndUnvalidated(BaseParser):
Expand All @@ -63,12 +63,12 @@ def unvalidated(self) -> str:
return "unsupported"

parser = ParserWithValidatedAndUnvalidated()
assert len(parser.attributes()) == 2
assert len(parser.attributes()) == 3

assert (validated := parser.attributes().validated)
assert isinstance(validated, AttributeCollection)
assert (funcs := list(validated)) != [parser.validated]
assert funcs[0].__func__ == parser.validated.__func__
assert funcs[1].__func__ == parser.validated.__func__

assert (unvalidated := parser.attributes().unvalidated)
assert isinstance(validated, AttributeCollection)
Expand Down
Loading