New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q&A: truncated binary as input to detection #269
Comments
Hello, Glad to hear it is satisfactory.
Good assertion.
Running the detection on a broken bytes suite is not supported as of today. Mainly because we rely on the decoders to assess if we can reasonably return a guess without getting the end-user handling a Short answer: It will say NOT UTF-8 or Nothing. We do not run the detection using all bytes, the main algorithm runs on smaller chunks and the performance concern should be minimal.
Yes, there is a possibility of improvement in the following case: incomplete bytes sequence (ending truncated).
For now, I recommend passing the whole content to avoid broken bytes suite and keeping the default kwarg in from_bytes(...). Do this instead. from charset_normalizer import from_path
guesses = from_path(file_path)
if guesses:
best_detection_result = guesses.best()
encoding = best_detection_result.encoding
payload = best_detection_result.raw
string = str(best_detection_result) Hope that answers your questions. |
thank you very much for the complete answer |
For what it’s worth, one way to handle this (inside the library) might be to do all the speculative decoding like this: def decode_bytes(sequences: Union[bytes, bytearray], encoding: str, complete: bool):
try:
return str(sequences, encoding=encoding)
except UnicodeDecodeError as error:
# If the error is in the final code point, it might be incomplete.
# Note: you could also check `"incomplete" in error.reason` to really know if
# this is about an incomplete code point, but I think that might be likely to
# break in other Python runtimes or future Python versions.
if len(sequences) - error.start < max_bytes_per_point(encoding):
return str(sequences[:error.start], encoding=encoding) Or a little more fancy, integrated into Python’s decoding system, and probably more performant: import codecs
def ignore_incomplete_final_code_point(error):
# Same notes as above about optionally checking `error.reason`.
if (
isinstance(error, UnicodeDecodeError)
and len(error.object) - error.start < max_bytes_per_point(error.encoding)
):
return ('', error.end)
raise error
codecs.register_error('ignore_incomplete_final_code_point', ignore_incomplete_final_code_point)
# Now to decode a buffer that might end in the middle of a code point:
str(sequences, encoding=encoding, errors='ignore_incomplete_final_code_point') Both of those assume you have a function called |
Yes, you've got part of the thinking right. but unfortunately, it will require a lot more work. |
Hi, congrats great library and very nice improvement on both accuracy and performance over classic chardet.
This is not a bug neither a feature request, it is more an "usage question".
Usually to detect encoding, it is desirable to sniff only first N bytes of file, then perform inference. This is convenient to avoid unnecesary IO bounds on file sniffing before real file load. As an example:
The question is simple, and did not manage to find any reference respect to it.:
What happens if to
charset_normalizer.from_bytes
there is provided a bytes chain that is truncated in such a way last bytes do not represent a valid utf-8 (or other encoding) character?It would be very nice to ignore last K bytes from inference (or to put this as a configurable parameter).
Also, is there any threshold on this amount of bytes to perform inference beyond of which the accuracy is not really improved consistently and could be set as "optimal" to this encoding detection?
The text was updated successfully, but these errors were encountered: