Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q&A: truncated binary as input to detection #269

Open
Guillermogsjc opened this issue Feb 22, 2023 · 4 comments
Open

Q&A: truncated binary as input to detection #269

Guillermogsjc opened this issue Feb 22, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@Guillermogsjc
Copy link

Guillermogsjc commented Feb 22, 2023

Hi, congrats great library and very nice improvement on both accuracy and performance over classic chardet.

This is not a bug neither a feature request, it is more an "usage question".

Usually to detect encoding, it is desirable to sniff only first N bytes of file, then perform inference. This is convenient to avoid unnecesary IO bounds on file sniffing before real file load. As an example:

import charset_normalizer

with open(file_path, 'rb') as raw_data:
    bin_data = raw_data.read(n_bytes_to_sniff_encoding)   
best_detection_result = charset_normalizer.from_bytes(bin_data).best()
encoding = best_detection_result.encoding

The question is simple, and did not manage to find any reference respect to it.:

What happens if to charset_normalizer.from_bytes there is provided a bytes chain that is truncated in such a way last bytes do not represent a valid utf-8 (or other encoding) character?

It would be very nice to ignore last K bytes from inference (or to put this as a configurable parameter).

Also, is there any threshold on this amount of bytes to perform inference beyond of which the accuracy is not really improved consistently and could be set as "optimal" to this encoding detection?

@Guillermogsjc Guillermogsjc added the enhancement New feature or request label Feb 22, 2023
@Ousret
Copy link
Owner

Ousret commented Feb 24, 2023

Hello,

Glad to hear it is satisfactory.

Usually to detect encoding, it is desirable to sniff only first N bytes of file, then perform inference. This is convenient to avoid unnecesary IO bounds on file sniffing before real file load.

Good assertion.

What happens if to charset_normalizer.from_bytes there is provided a bytes chain that is truncated in such a way last bytes do not represent a valid utf-8 (or other encoding) character?

Running the detection on a broken bytes suite is not supported as of today. Mainly because we rely on the decoders to assess if we can reasonably return a guess without getting the end-user handling a UnicodeDecodeError.

Short answer: It will say NOT UTF-8 or Nothing.

We do not run the detection using all bytes, the main algorithm runs on smaller chunks and the performance concern should be minimal.

It would be very nice to ignore last K bytes from inference (or to put this as a configurable parameter).

Yes, there is a possibility of improvement in the following case: incomplete bytes sequence (ending truncated).

Also, is there any threshold on this amount of bytes to perform inference beyond of which the accuracy is not really improved consistently and could be set as "optimal" to this encoding detection?

For now, I recommend passing the whole content to avoid broken bytes suite and keeping the default kwarg in from_bytes(...).

Do this instead.

from charset_normalizer import from_path

guesses = from_path(file_path)

if guesses:
    best_detection_result = guesses.best()
    encoding = best_detection_result.encoding
  
    payload = best_detection_result.raw
    string = str(best_detection_result)

Hope that answers your questions.

@Guillermogsjc
Copy link
Author

thank you very much for the complete answer

Ousret added a commit that referenced this issue Mar 6, 2023
@Mr0grog
Copy link

Mr0grog commented Oct 12, 2023

Running the detection on a broken bytes suite is not supported as of today. Mainly because we rely on the decoders to assess if we can reasonably return a guess without getting the end-user handling a UnicodeDecodeError.

For what it’s worth, one way to handle this (inside the library) might be to do all the speculative decoding like this:

def decode_bytes(sequences: Union[bytes, bytearray], encoding: str, complete: bool):
    try:
        return str(sequences, encoding=encoding)
    except UnicodeDecodeError as error:
        # If the error is in the final code point, it might be incomplete.
        # Note: you could also check `"incomplete" in error.reason` to really know if
        # this is about an incomplete code point, but I think that might be likely to
        # break in other Python runtimes or future Python versions.
        if len(sequences) - error.start < max_bytes_per_point(encoding):
            return str(sequences[:error.start], encoding=encoding)

Or a little more fancy, integrated into Python’s decoding system, and probably more performant:

import codecs

def ignore_incomplete_final_code_point(error):
    # Same notes as above about optionally checking `error.reason`.
    if (
        isinstance(error, UnicodeDecodeError)
        and len(error.object) - error.start < max_bytes_per_point(error.encoding)
    ):
        return ('', error.end)

    raise error

codecs.register_error('ignore_incomplete_final_code_point', ignore_incomplete_final_code_point)

# Now to decode a buffer that might end in the middle of a code point:
str(sequences, encoding=encoding, errors='ignore_incomplete_final_code_point')

Both of those assume you have a function called max_bytes_per_point() that gets the largest possible number of bytes per code point in a given encoding (e.g. max_bytes_per_point('big5') == 2, max_bytes_per_point('utf-8') == 4), but you could also replace that function call with 4 (it’d be slightly less accurate, but probably good enough).

@Ousret
Copy link
Owner

Ousret commented Oct 19, 2023

Yes, you've got part of the thinking right. but unfortunately, it will require a lot more work.
We are working on a solution, but it takes time. it's halfway there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants