Q&A: truncated binary as input to detection #269

Guillermogsjc · 2023-02-22T21:36:54Z

Hi, congrats great library and very nice improvement on both accuracy and performance over classic chardet.

This is not a bug neither a feature request, it is more an "usage question".

Usually to detect encoding, it is desirable to sniff only first N bytes of file, then perform inference. This is convenient to avoid unnecesary IO bounds on file sniffing before real file load. As an example:

import charset_normalizer

with open(file_path, 'rb') as raw_data:
    bin_data = raw_data.read(n_bytes_to_sniff_encoding)   
best_detection_result = charset_normalizer.from_bytes(bin_data).best()
encoding = best_detection_result.encoding

The question is simple, and did not manage to find any reference respect to it.:

What happens if to charset_normalizer.from_bytes there is provided a bytes chain that is truncated in such a way last bytes do not represent a valid utf-8 (or other encoding) character?

It would be very nice to ignore last K bytes from inference (or to put this as a configurable parameter).

Also, is there any threshold on this amount of bytes to perform inference beyond of which the accuracy is not really improved consistently and could be set as "optimal" to this encoding detection?

The text was updated successfully, but these errors were encountered:

Ousret · 2023-02-24T07:32:44Z

Hello,

Glad to hear it is satisfactory.

Usually to detect encoding, it is desirable to sniff only first N bytes of file, then perform inference. This is convenient to avoid unnecesary IO bounds on file sniffing before real file load.

Good assertion.

What happens if to charset_normalizer.from_bytes there is provided a bytes chain that is truncated in such a way last bytes do not represent a valid utf-8 (or other encoding) character?

Running the detection on a broken bytes suite is not supported as of today. Mainly because we rely on the decoders to assess if we can reasonably return a guess without getting the end-user handling a UnicodeDecodeError.

Short answer: It will say NOT UTF-8 or Nothing.

We do not run the detection using all bytes, the main algorithm runs on smaller chunks and the performance concern should be minimal.

It would be very nice to ignore last K bytes from inference (or to put this as a configurable parameter).

Yes, there is a possibility of improvement in the following case: incomplete bytes sequence (ending truncated).

Also, is there any threshold on this amount of bytes to perform inference beyond of which the accuracy is not really improved consistently and could be set as "optimal" to this encoding detection?

For now, I recommend passing the whole content to avoid broken bytes suite and keeping the default kwarg in from_bytes(...).

Do this instead.

from charset_normalizer import from_path

guesses = from_path(file_path)

if guesses:
    best_detection_result = guesses.best()
    encoding = best_detection_result.encoding
  
    payload = best_detection_result.raw
    string = str(best_detection_result)

Hope that answers your questions.

Guillermogsjc · 2023-02-27T11:14:47Z

thank you very much for the complete answer

related to #269

Mr0grog · 2023-10-12T02:33:40Z

Running the detection on a broken bytes suite is not supported as of today. Mainly because we rely on the decoders to assess if we can reasonably return a guess without getting the end-user handling a UnicodeDecodeError.

For what it’s worth, one way to handle this (inside the library) might be to do all the speculative decoding like this:

def decode_bytes(sequences: Union[bytes, bytearray], encoding: str, complete: bool):
    try:
        return str(sequences, encoding=encoding)
    except UnicodeDecodeError as error:
        # If the error is in the final code point, it might be incomplete.
        # Note: you could also check `"incomplete" in error.reason` to really know if
        # this is about an incomplete code point, but I think that might be likely to
        # break in other Python runtimes or future Python versions.
        if len(sequences) - error.start < max_bytes_per_point(encoding):
            return str(sequences[:error.start], encoding=encoding)

Or a little more fancy, integrated into Python’s decoding system, and probably more performant:

import codecs

def ignore_incomplete_final_code_point(error):
    # Same notes as above about optionally checking `error.reason`.
    if (
        isinstance(error, UnicodeDecodeError)
        and len(error.object) - error.start < max_bytes_per_point(error.encoding)
    ):
        return ('', error.end)

    raise error

codecs.register_error('ignore_incomplete_final_code_point', ignore_incomplete_final_code_point)

# Now to decode a buffer that might end in the middle of a code point:
str(sequences, encoding=encoding, errors='ignore_incomplete_final_code_point')

Both of those assume you have a function called max_bytes_per_point() that gets the largest possible number of bytes per code point in a given encoding (e.g. max_bytes_per_point('big5') == 2, max_bytes_per_point('utf-8') == 4), but you could also replace that function call with 4 (it’d be slightly less accurate, but probably good enough).

Ousret · 2023-10-19T06:02:19Z

Yes, you've got part of the thinking right. but unfortunately, it will require a lot more work.
We are working on a solution, but it takes time. it's halfway there.

Guillermogsjc added the enhancement New feature or request label Feb 22, 2023

Ousret added a commit that referenced this issue Mar 6, 2023

📝 Add note about stream case

561ead1

related to #269

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q&A: truncated binary as input to detection #269

Q&A: truncated binary as input to detection #269

Guillermogsjc commented Feb 22, 2023 •

edited

Ousret commented Feb 24, 2023

Guillermogsjc commented Feb 27, 2023

Mr0grog commented Oct 12, 2023

Ousret commented Oct 19, 2023

Q&A: truncated binary as input to detection #269

Q&A: truncated binary as input to detection #269

Comments

Guillermogsjc commented Feb 22, 2023 • edited

Ousret commented Feb 24, 2023

Guillermogsjc commented Feb 27, 2023

Mr0grog commented Oct 12, 2023

Ousret commented Oct 19, 2023

Guillermogsjc commented Feb 22, 2023 •

edited