Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 4.0.0 #140

Merged
merged 34 commits into from Dec 10, 2020
Merged

Release 4.0.0 #140

merged 34 commits into from Dec 10, 2020

Conversation

dan-blanchard
Copy link
Member

@dan-blanchard dan-blanchard commented Oct 11, 2017

The optimizations in master really could use a release, even though my model retraining work has stalled out lately (because I've been too busy with work and life).

This will have to be a major version change since the model format has changed entirely (even though 99% of users never mess with that).

dan-blanchard and others added 3 commits June 8, 2017 10:32
…models (#121)

* Convert single byte charset modules to use dicts of dicts for language modules

- Also provide conversion script

* Fix debug logging check

* Keep Hungarian commented out until we retrain
* Add API option to get all the encodings confidence #96

* make code more straightforward

by treating the self.done = True as a real finish point of the analysis

* use detect_all instead of detect(.., all=True)

* fix corner case of when there is no good prober
@dan-blanchard dan-blanchard changed the title Release 3.1.0 Release 4.0.0 Oct 11, 2017
Copy link
Member

@sigmavirus24 sigmavirus24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had time to review this visually, but it all looks fine to me.

@dan-blanchard
Copy link
Member Author

Ugh. I put together a little benchmark script to show the performance improvements that switching to dicts would have made. Unfortunately, it showed exactly the opposite. Turns out the microbenchmarks I was running to justify the change didn't hold up when everything was fully integrated.

Before:

Benchmarking chardet 3.0.4 on CPython 3.6.3 | packaged by conda-forge | (default, Oct  5 2017, 19:18:17)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 31184.416356877322
big5: 5.070345385292817
cp932: 3.2069605829343444
cp949: 2.083250883504551
euc-jp: 3.320824171408068
euc-kr: 4.498561131545299
euc-tw: 48.99756198139536
gb2312: 4.55051526320596
ibm855: 18.0173338760511
ibm866: 27.157370866963888
iso-2022-jp: 2384.34654084475
iso-2022-kr: 10246.253817026995
iso-8859-1: 89.04007574438816
iso-8859-5: 23.843149290107355
iso-8859-7: 45.85586100711808
koi8-r: 19.71956986432471
maccyrillic: 22.258476294246357
shift_jis: 3.498450722935723
tis-620: 10.728576291002154
utf-16: 64527.75384615385
utf-32: 118316.05077574048
utf-8: 16.722890545863986
utf-8-sig: 97997.75700934579
windows-1251: 23.389860495067502
windows-1252: 147.1726925668089
windows-1255: 9.893936751289512

Total time: 459.3649871349335s (7.815136330678761 calls per second)

After:

Benchmarking chardet 4.0.0 on CPython 3.6.3 | packaged by conda-forge | (default, Oct  5 2017, 19:18:17)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 2271.8771820750844
big5: -0.18189224524472447
cp932: 0.046362371132514735
cp949: -0.19279474147731213
euc-jp: -0.17010130600427997
euc-kr: 0.0074659949420379235
euc-tw: 12.422974186586565
gb2312: 0.009971666683429525
ibm855: 0.8959192996468062
ibm866: -2.029760099831517
iso-2022-jp: -5.679269251714231
iso-2022-kr: 8226.787363329757
iso-8859-1: -0.5427543139328606
iso-8859-5: -0.7698640100054561
iso-8859-7: -1.893714346652203
koi8-r: -0.26980212358143163
maccyrillic: 0.5434692105018506
shift_jis: -0.06122787025867593
tis-620: -0.46658667115513275
utf-16: 56870.335878882324
utf-32: -44211.73982167688
utf-8: -0.5287044522123061
utf-8-sig: -6518.934762889956
windows-1251: -0.16414366325980012
windows-1252: -23.830037157210995
windows-1255: -0.2958933200025804

Total time: 469.64176321029663s (7.644124269230431 calls per second)

Significant Differences (ignoring everything where difference is less than 1 call per second):

Calls per second deltas (positive = good, negative = bad):

ascii: 2271.8771820750844
euc-tw: 12.422974186586565
ibm866: -2.029760099831517
iso-2022-jp: -5.679269251714231
iso-2022-kr: 8226.787363329757
iso-8859-7: -1.893714346652203
utf-16: 56870.335878882324
utf-32: -44211.73982167688
utf-8-sig: -6518.934762889956
windows-1252: -23.830037157210995

I don't quite know what to do with these results. It looks like we're much faster at detecting ASCII, ISO-2022-KR, and UTF-16 now, but much worse at detecting UTF-32 and UTF-8-SIG. The most confusing part of that is that I didn't actually change of the ASCII detection code, and that happens before we even use the SBCS probers I actually modified.the ones I actually modified appear to have been mostly a wash.

Also, if you want to see what a true speed up actually is, try running it with cChardet:

Benchmarking cchardet 2.1.1 on CPython 3.6.3 | packaged by conda-forge | (default, Oct  5 2017, 19:18:17)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
--------------------------------------------------------------------------------
Calls per second for each encoding:
ascii: 265462.2784810127
big5: 2479.9403278330633
cp932: 2300.0972471026944
cp949: 3912.597014925373
euc-jp: 7759.846378605286
euc-kr: 2617.9834437344443
euc-tw: 2398.5269057013784
gb2312: 2580.697242287388
ibm855: 321.94328917658646
ibm866: 442.73316839260195
iso-2022-jp: 158875.15151515152
iso-2022-kr: 268865.641025641
iso-8859-1: 7852.786220239024
iso-8859-5: 520.6036793643053
iso-8859-7: 3327.2546981968762
koi8-r: 220.32986674266112
maccyrillic: 442.66703419435385
shift_jis: 5229.5663955513255
tis-620: 320.0622679735819
utf-16: 603496.9784172662
utf-32: 612307.1532846715
utf-8: 27442.071625344353
utf-8-sig: 453438.2702702703
windows-1251: 468.51856225429214
windows-1252: 28493.91304347826
windows-1255: 377.21009418033856

Total time: 4.861895561218262s (738.3951289773163 calls per second)

jdufresne and others added 6 commits October 21, 2017 11:23
The wheel package format supports including the license file. This is
done using the [metadata] section in the setup.cfg file. For additional
information on this feature, see:

https://wheel.readthedocs.io/en/stable/index.html#including-the-license-in-the-generated-wheel-file
Include license file in the generated wheel package
Helps pip decide what version of the library to install.

https://packaging.python.org/tutorials/distributing-packages/#python-requires

> If your project only runs on certain Python versions, setting the
> python_requires argument to the appropriate PEP 440 version specifier
> string will prevent pip from installing the project on other Python
> versions.

https://setuptools.readthedocs.io/en/latest/setuptools.html#new-and-changed-setup-keywords

> python_requires
>
> A string corresponding to a version specifier (as defined in PEP 440)
> for the Python version, used to specify the Requires-Python defined in
> PEP 345.
@elcolumbio
Copy link

elcolumbio commented Jun 2, 2018

Edited
Ok now i understand. If you have only very few none ASCII character rows, you can do something like the last thing.
Which gives you than a massive speedup.

On the first parsing remove all ASCII character?

def guess_encoding():

u = fp_r.original.content

detector = UniversalDetector()

detector.feed(u)

detector.done

detector.close()

return detector.result

%t guess_encoding()

4.71 s ± 63.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

def guess_encoding2():

bytelist = fp_r.original.content.splitlines()

guess = []

detector = UniversalDetector()

for line in bytelist:

    detector.reset()

    detector.feed(line)

    detector.close()

    guess.append(detector.result)

%t guess_encoding2()

39 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

hrnciar and others added 8 commits December 8, 2020 15:17
When packaging chardet and pip (which bundles it) in Fedora, we have realized
that there is a nonexecuatble file with a shebang line.

It seems that the primary purpose of this file is to be imported from Python
code or to be executed via python chardet/cli/chardetect.py or
python -m chardet.cli.chardetect and hence the shebang appears to be unnecessary.

Shebangs are hard to handle when doing downstream packaging, because it makes
sense for upstream to use #!/usr/bin/env python while in the RPM package, we
need to avoid that and use a more specific interpreter. Since the shebang was
unused, I propose to remove it to avoid the problems.
Since setuptools v41.5.0 (27 Oct 2019), the 'test' command is formally
deprecated and should not be used.

The pytest-runner package also lists itself as deprecated:
https://github.com/pytest-dev/pytest-runner

> Deprecation Notice
>
> pytest-runner depends on deprecated features of setuptools and relies
> on features that break security mechanisms in pip. For example
> 'setup_requires' and 'tests_require' bypass pip --require-hashes. See
> also pypa/setuptools#1684.
The CLI entry point is installed by setuptools through the
console_scripts option. This setuptools feature automatically constructs
a file with a shebang and sets the executable bit. The imported file
chardet.cli.chardetect doesn't also require this bit.
Throughout the rest of the chardet code we assume that FOUND_IT means we
can stop looking. Previously the CharsetGroupProber did not set its
state appropriately when a child prober returned FOUND_IT. This
substantially speeds up the chardet for most encodings.

Fixes #202
@dan-blanchard
Copy link
Member Author

dan-blanchard commented Dec 10, 2020

Now that we've got an actual performance improvement in #203, I'm going to push this absurdly long-lived release out. Retrained models will come in next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants