Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to exclude tags from extraction and normalize space #382

Merged
merged 6 commits into from
Mar 7, 2024

Conversation

MaxDall
Copy link
Collaborator

@MaxDall MaxDall commented Feb 27, 2024

As #338 showed (see Focus/Nation) sometimes JS is extracted as text into the article. This is due to Fundus using a broad XPath expression string() to extract text from nodes. While this makes it fast and easy to extract text from subsequent nodes, it does not allow Fundus to skip tag types (like <script>) during text extraction.

To address this issue this PR implements:

  • a new method text_content(excluded_tags: List[str]) defined on Node to extract text only from non-excluded nodes down the path
  • furthermore this PR adds a normalization function normalize-space to normalize spacing within all extracted text. Previously, extracted text was in a weird state between raw and normalized. The normalization is done with ' '.join(.split())

@MaxDall MaxDall mentioned this pull request Feb 27, 2024
39 tasks
@MaxDall MaxDall added the high priority Urgent PR. label Feb 27, 2024
@MaxDall MaxDall requested a review from dobbersc February 27, 2024 14:36
text = element.text or "" if not isinstance(element, lxml.html.HtmlComment) else ""
children = "".join([_text_content(child) for child in element.iterchildren()]) or ""
tail = element.tail or ""
return text + children + tail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return text + children + tail
return f"{text}{children}{tail}"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the plus concatenation over f-strings, it's just personal preference here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's just to avoid the creation of an immediate string text + children with the f-string.

MaxDall and others added 3 commits March 1, 2024 12:36
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
@MaxDall MaxDall merged commit 457f9ca into master Mar 7, 2024
4 checks passed
@MaxDall MaxDall deleted the rework-text-extraction branch March 7, 2024 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Urgent PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants