Skip to content

Releases: Ousret/charset_normalizer

Version 3.3.2

31 Oct 20:22
79dce48
Compare
Choose a tag to compare

3.3.2 (2023-10-31)

Fixed

  • Unintentional memory usage regression when using large payloads that match several encodings (#376)
  • Regression on some detection cases showcased in the documentation (#371)

Added

  • Noise (md) probe that identifies malformed Arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)

Version 3.3.1

22 Oct 15:07
5208644
Compare
Choose a tag to compare

3.3.1 (2023-10-22)

Changed

  • Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8
  • Improved the general detection reliability based on reports from the community

Release 3.3.0

30 Sep 06:08
165211a
Compare
Choose a tag to compare

3.3.0 (2023-09-30)

Added

  • Allow to execute the CLI (e.g. normalizer) through python -m charset_normalizer.cli or python -m charset_normalizer
  • Support for 9 forgotten encodings that are supported by Python but unlisted in encoding.aliases as they have no alias (#323)

Removed

  • (internal) Redundant utils.is_ascii function and unused function is_private_use_only
  • (internal) charset_normalizer.assets is moved inside charset_normalizer.constant

Changed

  • (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection
  • Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8

Fixed

  • Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in __lt__ (#350)

Version 3.2.0

07 Jul 18:02
0424c80
Compare
Choose a tag to compare

3.2.0 (2023-06-07)

Changed

  • Typehint for function from_path no longer enforce PathLike as its first argument
  • Minor improvement over the global detection reliability

Added

  • Introduce function is_binary that relies on main capabilities, and is optimized to detect binaries
  • Propagate enable_fallback argument throughout from_bytes, from_path, and from_fp that allow a deeper control over the detection (default True)
  • Explicit support for Python 3.12

Fixed

  • Edge case detection failure where a file would contain 'very-long' camel-cased word (Issue #289)

Version 3.1.0

06 Mar 06:48
db9af43
Compare
Choose a tag to compare

3.1.0 (2023-03-06)

Added

  • Argument should_rename_legacy for legacy function detect and disregard any new arguments without errors (PR #262)

Removed

  • Support for Python 3.6 (PR #260)

Changed

  • Optional speedup provided by mypy/c 1.0.1

Version 3.0.1

18 Nov 05:45
5dd7aa0
Compare
Choose a tag to compare

3.0.1 (2022-11-18)

Fixed

  • Multi-bytes cutter/chunk generator did not always cut correctly (PR #233)

Changed

  • Speedup provided using mypy/c 0.990 on Python >= 3.7

Version 3.0.0

20 Oct 09:00
0ec52ef
Compare
Choose a tag to compare

3.0.0 (2022-10-20)

Added

  • Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
  • Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
  • Add parameter language_threshold in from_bytes, from_path and from_fp to adjust the minimum expected coherence ratio
  • normalizer --version now specify if the current version provides extra speedup (meaning mypyc compilation whl)

Changed

  • Build with static metadata (not pyproject.toml yet)
  • Make language detection stricter
  • Optional: Module md.py can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1

Fixed

  • CLI with opt --normalize fail when using full path for files
  • TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it
  • Sphinx warnings when generating the documentation

Removed

  • Coherence detector no longer returns 'Simple English' instead returns 'English'
  • Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'
  • Breaking: Method first() and best() from CharsetMatch
  • UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflicts with ASCII)
  • Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
  • Breaking: Top-level function normalize
  • Breaking: Properties chaos_secondary_pass, coherence_non_latin and w_counter from CharsetMatch
  • Support for the backport unicodedata2

This is the last version (3.0.x) to support Python 3.6 We plan to drop it for 3.1.x

Version 3.0.0rc1

18 Oct 19:18
544595d
Compare
Choose a tag to compare
Version 3.0.0rc1 Pre-release
Pre-release

This is the last pre-release. If everything goes well, I will publish the stable tag.

3.0.0rc1 (2022-10-18)

Added

  • Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
  • Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
  • Add parameter language_threshold in from_bytes, from_path and from_fp to adjust the minimum expected coherence ratio

Changed

  • Build with static metadata using 'build' frontend
  • Make language detection stricter

Fixed

  • CLI with opt --normalize fail when using full path for files
  • TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it

Removed

  • Coherence detector no longer returns 'Simple English' instead returns 'English'
  • Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'

Version 3.0.0b2

21 Aug 19:07
Compare
Choose a tag to compare
Version 3.0.0b2 Pre-release
Pre-release

3.0.0b2 (2022-08-21)

Added

  • normalizer --version now specify if current version provide extra speedup (meaning mypyc compilation whl)

Removed

  • Breaking: Method first() and best() from CharsetMatch
  • UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)

Fixed

  • Sphinx warnings when generating the documentation

Version 2.1.1

19 Aug 21:56
86cda88
Compare
Choose a tag to compare

2.1.1 (2022-08-19)

Deprecated

  • Function normalize scheduled for removal in 3.0

Changed

  • Removed useless call to decode in fn is_unprintable (#206)

Fixed

  • Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from @aleksandernovikov (#204)