Add functionality to exclude tags from extraction and normalize space #382

MaxDall · 2024-02-27T14:23:48Z

As #338 showed (see Focus/Nation) sometimes JS is extracted as text into the article. This is due to Fundus using a broad XPath expression string() to extract text from nodes. While this makes it fast and easy to extract text from subsequent nodes, it does not allow Fundus to skip tag types (like <script>) during text extraction.

To address this issue this PR implements:

a new method text_content(excluded_tags: List[str]) defined on Node to extract text only from non-excluded nodes down the path
furthermore this PR adds a normalization function normalize-space to normalize spacing within all extracted text. Previously, extracted text was in a weird state between raw and normalized. The normalization is done with ' '.join(.split())

src/fundus/parser/utility.py

dobbersc · 2024-02-29T18:14:41Z

src/fundus/parser/utility.py

+            text = element.text or "" if not isinstance(element, lxml.html.HtmlComment) else ""
+            children = "".join([_text_content(child) for child in element.iterchildren()]) or ""
+            tail = element.tail or ""
+            return text + children + tail


Suggested change

return text + children + tail

return f"{text}{children}{tail}"

I would prefer the plus concatenation over f-strings, it's just personal preference here.

Ok, it's just to avoid the creation of an immediate string text + children with the f-string.

src/fundus/parser/utility.py

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

MaxDall added 3 commits February 27, 2024 14:52

add functionality to exclude tags from extraction and normalize space

fbcde65

remove leftover strip

95ca9ac

fix mypy

46167a5

MaxDall mentioned this pull request Feb 27, 2024

Quality control for parser test cases. #354

Closed

39 tasks

MaxDall added the high priority Urgent PR. label Feb 27, 2024

MaxDall requested a review from dobbersc February 27, 2024 14:36

dobbersc requested changes Feb 29, 2024

View reviewed changes

MaxDall and others added 3 commits March 1, 2024 12:36

Update src/fundus/parser/utility.py

5081a74

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

Apply suggestions from code review

15d3311

Co-authored-by: Conrad Dobberstein <29147025+dobbersc@users.noreply.github.com>

black

108f05a

dobbersc approved these changes Mar 4, 2024

View reviewed changes

MaxDall merged commit 457f9ca into master Mar 7, 2024
4 checks passed

MaxDall deleted the rework-text-extraction branch March 7, 2024 12:20

MaxDall mentioned this pull request Apr 24, 2024

[Feature Request]: Quality check for extracted articles (javascript code in articles) #275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to exclude tags from extraction and normalize space #382

Add functionality to exclude tags from extraction and normalize space #382

MaxDall commented Feb 27, 2024

dobbersc Feb 29, 2024

MaxDall Mar 4, 2024

dobbersc Mar 4, 2024

	return text + children + tail
	return f"{text}{children}{tail}"

Add functionality to exclude tags from extraction and normalize space #382

Add functionality to exclude tags from extraction and normalize space #382

Conversation

MaxDall commented Feb 27, 2024

dobbersc Feb 29, 2024

Choose a reason for hiding this comment

MaxDall Mar 4, 2024

Choose a reason for hiding this comment

dobbersc Mar 4, 2024

Choose a reason for hiding this comment