-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on language detection for some unicode characters (control characters) #71
Comments
To find them all: >>> bads = set()
>>> for i in range(10000):
... try:
... Sentence("try %s it" % chr(i)).words
... except:
... bads.add(i)
>>> ", ".join(chr(i) for i in sorted(list(bads)))
'\x00, \x01, \x02, \x03, \x04, \x05, \x06, \x07, \x08, \x0b, \x0e, \x0f, \x10, \x11, \x12, \x13, \x14, \x15, \x16, \x17, \x18, \x19, \x1a, \x1b, \x1c, \x1d, \x1e, \x1f, \x7f, \x80, \x81, \x82, \x83, \x84, \x85, \x86, \x87, \x88, \x89, \x8a, \x8b, \x8c, \x8d, \x8e, \x8f, \x90, \x91, \x92, \x93, \x94, \x95, \x96, \x97, \x98, \x99, \x9a, \x9b, \x9c, \x9d, \x9e, \x9f' My proposition would be either to fix cld2 (is it possible) or to just remove those characters from the sentence before submission for detection. |
To bypass cld2, you can also instantiate |
Hello, How to catch this error The is the exception should I catch ? Thanks |
Correct me if I'm wrong, but this still seems to be a problem. |
For the moment on my side, I simply filter out bad characters before submission… |
I believe this is the underlying issue in cld2: mikemccand/chromium-compact-language-detector#22 |
I've used the following command to replace control characters from my dataset, using the list of characters provided by @alexgarel above.
Posting it here in case it's useful for anyone else hitting this problem, but I'm not convinced that the list of characters above is complete as I still have issues on some files. |
@jamesdbaker you might want to add 'g' switch for multiple substitution.
|
@andreoua provided a nice succinct workaround to this
This won't work for Python 2.7 users, but for those of us who have moved forward, there's an easy workaround. |
`pycld` is fussy where it comes to UTF-8 (see mikemccand/chromium-compact-language-detector#22 and aboSamoor/polyglot#71). This strips out the characters that make `cld` choke. Thanks to @andreoua for the suggested fix.
I actually did give hint_language_code but still receiving the same error. |
It's actually only the import regex
RE_BAD_CHARS = regex.compile(r"[\p{Cc}\p{Cs}\p{Cn}]+")
def remove_bad_chars(text):
return RE_BAD_CHARS.sub("", text)
remove_bad_chars("A\x96 bad char") # Cc category
# 'A bad char' I brute-forced each unicode character through Brute-force scriptimport sys
import unicodedata
from collections import defaultdict
unicode_characters_per_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
unicode_characters_per_category[unicodedata.category(c)].append(c)
all_categories = [
"Cc", # Control 65
"Cf", # Format 161
"Co", # Private Use 0
"Cs", # Surrrogate 0
"Ll", # Lowercase Letter 2,151
"Lm", # Modifier Letter 259
"Lo", # Other Letter 121,414
"Lt", # Titlecase Letter 31
"Lu", # Uppercase Letter 1,788
"Mc", # Spacing Mark 429
"Me", # Enclosing Mark 13
"Mn", # Nonspacing Mark 1,826
"Nd", # Decimal Number 630
"Nl", # Letter Number 236
"No", # Other Number 888
"Pc", # Connector Punctuation 10
"Pd", # Dash Punctuation 24
"Pe", # Close Punctuation 73
"Pf", # Final Punctuation 10
"Pi", # Initial Punctuation 12
"Po", # Other Punctuation 588
"Ps", # Open Punctuation 75
"Sc", # Currency Symbol 62
"Sk", # Modifier Symbol 121
"Sm", # Math Symbol 948
"So", # Other Symbol 6,160
"Zl", # Line Separator 1
"Zp", # Paragraph Separator 1
"Zs", # Space Separator 17
]
from polyglot.text import Text
error_cats = set()
for cat in all_categories:
for char in unicode_characters_per_category[cat]:
try:
Text(char).words
except:
error_cats.add(cat)
# all categories that errored
print(error_cats) |
Can you let me know, how to remove all these chars, in a single go? (I have a large text file of 20GB) |
@lucifer-it
and then the if the file is too large for your RAM, you can write out a new file in chunks, removing bad characters one chunk at a time in a while loop, e.g. like https://stackoverflow.com/a/61394102/5511061 |
I think the This is the relevant issue in the pycld2 project. btw @ddelange your brute-force identification of the specific characters that are the root cause is just beautiful :chefs kiss: |
cld2.detectがUTF-8の特定のバイトコードに遭遇するとエラーが発生する(aboSamoor/polyglot#71) try-exceptでエラー対処を行い、条件分岐を簡単にするために、detected_lang変数を導入した。
I found that I had to regex sub "Non-character"
I adjusted the solution by @ddelange to use:
|
thanks @christopher-siewert, I've updated my solution accordingly |
running polyglot 16.07.04 on ubuntu 16:04
The text was updated successfully, but these errors were encountered: