Skip to content
This repository was archived by the owner on Nov 28, 2021. It is now read-only.

Commit 60d0f53

Browse files
committed
wordstat: Handle "invalid" UTF-8.
`pycld` is fussy where it comes to UTF-8 (see mikemccand/chromium-compact-language-detector#22 and aboSamoor/polyglot#71). This strips out the characters that make `cld` choke. Thanks to @andreoua for the suggested fix.
1 parent 3638d0b commit 60d0f53

File tree

1 file changed

+10
-1
lines changed

1 file changed

+10
-1
lines changed

hadsh/wordstat.py

+10-1
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,23 @@
55
from string import punctuation
66

77

8+
def stripunprintable(s):
9+
"""
10+
Strip non-printable characters
11+
"""
12+
return ''.join(c for c in s if c.isprintable())
13+
14+
815
def tokenise(html_text):
916
"""
1017
Return a list of words that appear in the text.
1118
"""
1219
try:
1320
return list(
1421
filter(lambda w : w not in punctuation,
15-
Text(html_to_text(html_text)).lower().words))
22+
Text(stripunprintable(
23+
html_to_text(html_text))
24+
).lower().words))
1625
except ValueError:
1726
# Empty sequence?
1827
return []

0 commit comments

Comments
 (0)