`chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks #286

AdrianB-sovo · 2024-03-05T22:38:22Z

I noticed a very big difference in execution time between using chardet.detect with a complete bytes string and using UniversalDetector.feed with chunks of 1 MB.

With a 100 MB file, composed only of "tests tests tests tests [....]":
- chardet.detect takes ~64 seconds.
- UniversalDetector.feed takes ~3 seconds.
With the previous file on which I appended a file in MacRoman of ~10 KB (containing the character ’ in MacRoman):
- chardet.detect: I interrupted the execution after 20 minutes...
- UniversalDetector.feed takes ~3 seconds.

In case you wonder what code I used, I compared the execution time of the following:

chardet.detect:

print(detect(original_txt))

UniversalDetector.feed:

num_chunks_processed = 0
for start, end in _get_chunk_slice_intervals(len(original_txt), CHUNK_SIZE):
    chunk = original_txt[start:end]
    detector.feed(chunk)
    num_chunks_processed += 1
detector.close()

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks #286

`chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks #286

AdrianB-sovo commented Mar 5, 2024

chardet.detect a lot slower than using UniversalDetector.feed with chunks #286

chardet.detect a lot slower than using UniversalDetector.feed with chunks #286

Comments

AdrianB-sovo commented Mar 5, 2024

`chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks #286

`chardet.detect` a lot slower than using `UniversalDetector.feed` with chunks #286