Error on language detection for some unicode characters (control characters) #71

alexgarel · 2016-08-29T11:38:49Z

>>> from polyglot.text import Sentence
>>> Sentence("try \x96 it").words
...
/usr/local/lib/python3.5/dist-packages/polyglot/detect/base.py in detect(self, text)
---> 84     reliable, index, top_3_choices = cld2.detect(t, bestEffort=False)

error: input contains invalid UTF-8 around byte 4 (of 9)

running polyglot 16.07.04 on ubuntu 16:04

alexgarel · 2016-08-29T11:44:32Z

To find them all:

>>> bads = set()
>>> for i in range(10000):
...     try:
...         Sentence("try %s it" % chr(i)).words
...     except:
...         bads.add(i)
>>> ", ".join(chr(i) for i in sorted(list(bads)))
'\x00, \x01, \x02, \x03, \x04, \x05, \x06, \x07, \x08, \x0b, \x0e, \x0f, \x10, \x11, \x12, \x13, \x14, \x15, \x16, \x17, \x18, \x19, \x1a, \x1b, \x1c, \x1d, \x1e, \x1f, \x7f, \x80, \x81, \x82, \x83, \x84, \x85, \x86, \x87, \x88, \x89, \x8a, \x8b, \x8c, \x8d, \x8e, \x8f, \x90, \x91, \x92, \x93, \x94, \x95, \x96, \x97, \x98, \x99, \x9a, \x9b, \x9c, \x9d, \x9e, \x9f'

My proposition would be either to fix cld2 (is it possible) or to just remove those characters from the sentence before submission for detection.

tindzk · 2016-12-14T02:18:45Z

To bypass cld2, you can also instantiate Text with the hint_language_code parameter.

motazsaad · 2017-05-05T12:51:26Z

Hello,

How to catch this error
error: input contains invalid UTF-8 around byte ...

The is the exception should I catch ?

Thanks

mamoit · 2017-11-16T15:14:08Z

Correct me if I'm wrong, but this still seems to be a problem.

alexgarel · 2017-11-16T15:37:52Z

For the moment on my side, I simply filter out bad characters before submission…

jamesdbaker · 2018-02-21T12:06:02Z

I believe this is the underlying issue in cld2: mikemccand/chromium-compact-language-detector#22

jamesdbaker · 2018-02-21T12:30:43Z

I've used the following command to replace control characters from my dataset, using the list of characters provided by @alexgarel above.

sed 's/[\00\01\02\03\04\05\06\07\08\0b\0e\0f\10\11\12\13\14\15\16\17\18\19\1a\1b\1c\1d\1e\1f\7f\80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f]//' input.txt > output.txt

Posting it here in case it's useful for anyone else hitting this problem, but I'm not convinced that the list of characters above is complete as I still have issues on some files.

vldbnc · 2018-10-23T14:02:21Z

@jamesdbaker you might want to add 'g' switch for multiple substitution.

's/[\00\01\02\03\04\05\06\07\08\0b\0e\0f\10\11\12\13\14\15\16\17\18\19\1a\1b\1c\1d\1e\1f\7f\80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f]//g' input.txt > output.txt

sjlongland · 2018-12-07T11:07:25Z

@andreoua provided a nice succinct workaround to this pycld issue (see @jamesdbaker's link) which works in Python 3.6…

printable_str = ''.join(x for x in html_str if x.isprintable())

This won't work for Python 2.7 users, but for those of us who have moved forward, there's an easy workaround.

@andreoua

`pycld` is fussy where it comes to UTF-8 (see mikemccand/chromium-compact-language-detector#22 and aboSamoor/polyglot#71). This strips out the characters that make `cld` choke. Thanks to @andreoua for the suggested fix.

zafercavdar · 2019-08-20T18:50:27Z

To bypass cld2, you can also instantiate Text with the hint_language_code parameter.

I actually did give hint_language_code but still receiving the same error.

…here aboSamoor/polyglot#71

ddelange · 2020-10-13T20:43:28Z

It's actually only the Cc, Cs and Cn unicode categories that throw this error as far as I can tell. Using regex to remove them as suggested here should do the trick.

import regex

RE_BAD_CHARS = regex.compile(r"[\p{Cc}\p{Cs}\p{Cn}]+")

def remove_bad_chars(text):
    return RE_BAD_CHARS.sub("", text)

remove_bad_chars("A\x96 bad char")  # Cc category
# 'A bad char'

I brute-forced each unicode character through polyglot on py38:

Brute-force script

import sys
import unicodedata
from collections import defaultdict

unicode_characters_per_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
    unicode_characters_per_category[unicodedata.category(c)].append(c)

all_categories = [
    "Cc",  # Control 65
    "Cf",  # Format  161
    "Co",  # Private Use 0
    "Cs",  # Surrrogate  0
    "Ll",  # Lowercase Letter    2,151
    "Lm",  # Modifier Letter 259
    "Lo",  # Other Letter    121,414
    "Lt",  # Titlecase Letter    31
    "Lu",  # Uppercase Letter    1,788
    "Mc",  # Spacing Mark    429
    "Me",  # Enclosing Mark  13
    "Mn",  # Nonspacing Mark 1,826
    "Nd",  # Decimal Number  630
    "Nl",  # Letter Number   236
    "No",  # Other Number    888
    "Pc",  # Connector Punctuation   10
    "Pd",  # Dash Punctuation    24
    "Pe",  # Close Punctuation   73
    "Pf",  # Final Punctuation   10
    "Pi",  # Initial Punctuation 12
    "Po",  # Other Punctuation   588
    "Ps",  # Open Punctuation    75
    "Sc",  # Currency Symbol 62
    "Sk",  # Modifier Symbol 121
    "Sm",  # Math Symbol 948
    "So",  # Other Symbol    6,160
    "Zl",  # Line Separator  1
    "Zp",  # Paragraph Separator 1
    "Zs",  # Space Separator 17
]

from polyglot.text import Text

error_cats = set()
for cat in all_categories:
    for char in unicode_characters_per_category[cat]:
        try:
            Text(char).words
        except:
            error_cats.add(cat)

# all categories that errored
print(error_cats)

ayush-8 · 2021-01-02T04:10:36Z

To find them all:

>>> bads = set()
>>> for i in range(10000):
...     try:
...         Sentence("try %s it" % chr(i)).words
...     except:
...         bads.add(i)
>>> ", ".join(chr(i) for i in sorted(list(bads)))
'\x00, \x01, \x02, \x03, \x04, \x05, \x06, \x07, \x08, \x0b, \x0e, \x0f, \x10, \x11, \x12, \x13, \x14, \x15, \x16, \x17, \x18, \x19, \x1a, \x1b, \x1c, \x1d, \x1e, \x1f, \x7f, \x80, \x81, \x82, \x83, \x84, \x85, \x86, \x87, \x88, \x89, \x8a, \x8b, \x8c, \x8d, \x8e, \x8f, \x90, \x91, \x92, \x93, \x94, \x95, \x96, \x97, \x98, \x99, \x9a, \x9b, \x9c, \x9d, \x9e, \x9f'

My proposition would be either to fix cld2 (is it possible) or to just remove those characters from the sentence before submission for detection.

Can you let me know, how to remove all these chars, in a single go? (I have a large text file of 20GB)

ddelange · 2021-01-02T11:52:36Z

@lucifer-it

pip install regex

and then the remove_bad_chars from the snippet above (it does a one-pass replacement)

if the file is too large for your RAM, you can write out a new file in chunks, removing bad characters one chunk at a time in a while loop, e.g. like https://stackoverflow.com/a/61394102/5511061

ned2 · 2022-03-28T12:04:09Z

I think the cld2 links in this thread may be pointing to the wrong project. Polyglot depends on pycld2, rather than this older port of the cld2 library. Since the former is a fork of the latter, I'm gonna hazard a guess that the bug exists is both projects.

This is the relevant issue in the pycld2 project.

btw @ddelange your brute-force identification of the specific characters that are the root cause is just beautiful :chefs kiss:

cld2.detectがUTF-8の特定のバイトコードに遭遇するとエラーが発生する（aboSamoor/polyglot#71） try-exceptでエラー対処を行い、条件分岐を簡単にするために、detected_lang変数を導入した。

christopher-siewert · 2024-10-31T23:51:35Z

I found that I had to regex sub "Non-character" Cn unicode characters to avoid pycld2 errors as well.

error_categories = set()
for c in map(chr, range(sys.maxunicode + 1)):
    try:
        pycld2.detect(c, returnVectors=True)
    except:
        error_categories.add(unicodedata.category(c))
print(error_categories)
# {'Cs', 'Cn', 'Cc'}

I adjusted the solution by @ddelange to use:

RE_BAD_CHARS = regex.compile(r"[\p{Cc}\p{Cs}\p{Cn}]+")

ddelange · 2024-11-01T08:43:16Z

thanks @christopher-siewert, I've updated my solution accordingly

KieranLitschel added a commit to KieranLitschel/Contextualised-Image-Classifiers that referenced this issue Feb 23, 2020

Fixed bu where language can't be detected using workaround described …

e2c54aa

…here aboSamoor/polyglot#71

ddelange mentioned this issue Oct 13, 2020

Unable to handle utf-8 characters that python can handle? mikemccand/chromium-compact-language-detector#22

Open

ned2 mentioned this issue Mar 28, 2022

error: input contains invalid UTF-8 around byte 30 (of 68) aboSamoor/pycld2#53

Open

omazapa mentioned this issue Nov 30, 2023

input contains invalid UTF-8 around byte... colav/Kahi_plugins#129

Closed

omazapa mentioned this issue Jul 16, 2024

[kahi_impactu_utils] Language detection error using cld2 colav/impactu#186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on language detection for some unicode characters (control characters) #71

Error on language detection for some unicode characters (control characters) #71

alexgarel commented Aug 29, 2016 •

edited

Loading

alexgarel commented Aug 29, 2016 •

edited

Loading

tindzk commented Dec 14, 2016

motazsaad commented May 5, 2017

mamoit commented Nov 16, 2017

alexgarel commented Nov 16, 2017

jamesdbaker commented Feb 21, 2018

jamesdbaker commented Feb 21, 2018

vldbnc commented Oct 23, 2018

sjlongland commented Dec 7, 2018 •

edited

Loading

zafercavdar commented Aug 20, 2019

ddelange commented Oct 13, 2020 •

edited

Loading

ayush-8 commented Jan 2, 2021

ddelange commented Jan 2, 2021 •

edited

Loading

ned2 commented Mar 28, 2022

christopher-siewert commented Oct 31, 2024

ddelange commented Nov 1, 2024

Error on language detection for some unicode characters (control characters) #71

Error on language detection for some unicode characters (control characters) #71

Comments

alexgarel commented Aug 29, 2016 • edited Loading

alexgarel commented Aug 29, 2016 • edited Loading

tindzk commented Dec 14, 2016

motazsaad commented May 5, 2017

mamoit commented Nov 16, 2017

alexgarel commented Nov 16, 2017

jamesdbaker commented Feb 21, 2018

jamesdbaker commented Feb 21, 2018

vldbnc commented Oct 23, 2018

sjlongland commented Dec 7, 2018 • edited Loading

zafercavdar commented Aug 20, 2019

ddelange commented Oct 13, 2020 • edited Loading

ayush-8 commented Jan 2, 2021

ddelange commented Jan 2, 2021 • edited Loading

ned2 commented Mar 28, 2022

christopher-siewert commented Oct 31, 2024

ddelange commented Nov 1, 2024

alexgarel commented Aug 29, 2016 •

edited

Loading

alexgarel commented Aug 29, 2016 •

edited

Loading

sjlongland commented Dec 7, 2018 •

edited

Loading

ddelange commented Oct 13, 2020 •

edited

Loading

ddelange commented Jan 2, 2021 •

edited

Loading