Releases: Ousret/charset_normalizer
Releases ยท Ousret/charset_normalizer
Version 3.3.2
3.3.2 (2023-10-31)
Fixed
- Unintentional memory usage regression when using large payloads that match several encodings (#376)
- Regression on some detection cases showcased in the documentation (#371)
Added
- Noise (md) probe that identifies malformed Arabic representation due to the presence of letters in isolated form (credit to my wife, thanks!)
Version 3.3.1
3.3.1 (2023-10-22)
Changed
- Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8
- Improved the general detection reliability based on reports from the community
Release 3.3.0
3.3.0 (2023-09-30)
Added
- Allow to execute the CLI (e.g. normalizer) through
python -m charset_normalizer.cli
orpython -m charset_normalizer
- Support for 9 forgotten encodings that are supported by Python but unlisted in
encoding.aliases
as they have no alias (#323)
Removed
- (internal) Redundant utils.is_ascii function and unused function is_private_use_only
- (internal) charset_normalizer.assets is moved inside charset_normalizer.constant
Changed
- (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection
- Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8
Fixed
- Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in __lt__ (#350)
Version 3.2.0
3.2.0 (2023-06-07)
Changed
- Typehint for function
from_path
no longer enforcePathLike
as its first argument - Minor improvement over the global detection reliability
Added
- Introduce function
is_binary
that relies on main capabilities, and is optimized to detect binaries - Propagate
enable_fallback
argument throughoutfrom_bytes
,from_path
, andfrom_fp
that allow a deeper control over the detection (default True) - Explicit support for Python 3.12
Fixed
- Edge case detection failure where a file would contain 'very-long' camel-cased word (Issue #289)
Version 3.1.0
Version 3.0.1
Version 3.0.0
3.0.0 (2022-10-20)
Added
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
- Add parameter
language_threshold
infrom_bytes
,from_path
andfrom_fp
to adjust the minimum expected coherence ratio normalizer --version
now specify if the current version provides extra speedup (meaning mypyc compilation whl)
Changed
- Build with static metadata (not pyproject.toml yet)
- Make language detection stricter
- Optional: Module
md.py
can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
Fixed
- CLI with opt --normalize fail when using full path for files
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it
- Sphinx warnings when generating the documentation
Removed
- Coherence detector no longer returns 'Simple English' instead returns 'English'
- Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'
- Breaking: Method
first()
andbest()
from CharsetMatch - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflicts with ASCII)
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
- Breaking: Top-level function
normalize
- Breaking: Properties
chaos_secondary_pass
,coherence_non_latin
andw_counter
from CharsetMatch - Support for the backport
unicodedata2
This is the last version (3.0.x) to support Python 3.6 We plan to drop it for 3.1.x
Version 3.0.0rc1
This is the last pre-release. If everything goes well, I will publish the stable tag.
3.0.0rc1 (2022-10-18)
Added
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
- Add parameter
language_threshold
infrom_bytes
,from_path
andfrom_fp
to adjust the minimum expected coherence ratio
Changed
- Build with static metadata using 'build' frontend
- Make language detection stricter
Fixed
- CLI with opt --normalize fail when using full path for files
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it
Removed
- Coherence detector no longer returns 'Simple English' instead returns 'English'
- Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'
Version 3.0.0b2
3.0.0b2 (2022-08-21)
Added
normalizer --version
now specify if current version provide extra speedup (meaning mypyc compilation whl)
Removed
- Breaking: Method
first()
andbest()
from CharsetMatch - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
Fixed
- Sphinx warnings when generating the documentation
Version 2.1.1
2.1.1 (2022-08-19)
Deprecated
- Function
normalize
scheduled for removal in 3.0
Changed
- Removed useless call to decode in fn is_unprintable (#206)
Fixed
- Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from @aleksandernovikov (#204)