Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🩹 MD sensitivity adjustments (take two) #76

Merged
merged 3 commits into from Jul 30, 2021
Merged

Conversation

Ousret
Copy link
Owner

@Ousret Ousret commented Jul 26, 2021

This PR was meticulously made with the latest observations from the community. (Incl. some of my own)

  • Excluding ASCII text from the MD plugin ArchaicUpperLower
  • is_accentuated function from utils.py was incomplete, therefore, the MD plugins detection that depended on it was biased.
  • Lowered the word count threshold for MD plugin SuperWierdWord
    • And improve the weird word detection with suspiciously long ones.

@Ousret Ousret added enhancement New feature or request detection Related to the charset detection mechanism, chaos/mess/coherence labels Jul 26, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jul 30, 2021

Codecov Report

Merging #76 (3e1d2a7) into v2.0.4 (30602ce) will decrease coverage by 0.80%.
The diff coverage is 85.13%.

Impacted file tree graph

@@            Coverage Diff             @@
##           v2.0.4      #76      +/-   ##
==========================================
- Coverage   85.70%   84.89%   -0.81%     
==========================================
  Files          11       11              
  Lines        1077     1139      +62     
==========================================
+ Hits          923      967      +44     
- Misses        154      172      +18     
Impacted Files Coverage Δ
charset_normalizer/utils.py 77.40% <80.00%> (-1.02%) ⬇️
charset_normalizer/md.py 88.19% <91.17%> (-0.42%) ⬇️
charset_normalizer/api.py 83.06% <0.00%> (-2.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 30602ce...3e1d2a7. Read the comment docs.

@Ousret
Copy link
Owner Author

Ousret commented Jul 30, 2021

+ 1% without preemptive. Good sign.
Fixed some basic file detection. (Mostly JSON, CSV)
This PR has a positive impact on detection.

@Ousret Ousret marked this pull request as ready for review July 30, 2021 19:26
@Ousret
Copy link
Owner Author

Ousret commented Jul 30, 2021

Will fix the coverage in another PR.

@Ousret Ousret merged commit 54708dd into v2.0.4 Jul 30, 2021
@Ousret Ousret deleted the patch-md-improvement branch July 30, 2021 20:07
@Ousret Ousret mentioned this pull request Jul 30, 2021
Ousret added a commit that referenced this pull request Jul 30, 2021
* 🔖 Bump version to 2.0.4

* 🩹 MD sensitivity adjustments (#76)

* 🩹 MD sensitivity adjustments 
* 📌Make sure that CN deps from requests does not shadow the current dev-version

* 📝 Do not mislead, dont say if multibyte, priority given (logger, explain)

* 📝 🐛 Tiny mistake when logging detected language using specific cp (debug, explain)

* 🐛 submatch factoring were incorrect in rare cases

* 📝 ⚡ Performance claims update

* 🐛 Multiple file given to the CLI would not result in array JSON (omit after the first file)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants