-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to exclude tags from extraction and normalize space #382
Conversation
text = element.text or "" if not isinstance(element, lxml.html.HtmlComment) else "" | ||
children = "".join([_text_content(child) for child in element.iterchildren()]) or "" | ||
tail = element.tail or "" | ||
return text + children + tail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return text + children + tail | |
return f"{text}{children}{tail}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the plus concatenation over f-strings, it's just personal preference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it's just to avoid the creation of an immediate string text + children
with the f-string.
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>
As #338 showed (see Focus/Nation) sometimes JS is extracted as text into the article. This is due to Fundus using a broad XPath expression
string()
to extract text from nodes. While this makes it fast and easy to extract text from subsequent nodes, it does not allow Fundus to skip tag types (like<script>
) during text extraction.To address this issue this PR implements:
text_content(excluded_tags: List[str])
defined onNode
to extract text only from non-excluded nodes down the pathnormalize-space
to normalize spacing within all extracted text. Previously, extracted text was in a weird state betweenraw
andnormalized
. The normalization is done with' '.join(.split())