Skip to content

Commit

Permalink
❇️ Improvement over Vietnamese detection (#126)
Browse files Browse the repository at this point in the history
Latin character with combining diacritical mark were miss detected/seen as mess (MD).
  • Loading branch information
Ousret committed Oct 23, 2021
1 parent 8b52c35 commit 5c72742
Showing 1 changed file with 6 additions and 0 deletions.
6 changes: 6 additions & 0 deletions charset_normalizer/md.py
Expand Up @@ -453,6 +453,12 @@ def is_suspiciously_successive_range(
if "Emoticons" in unicode_range_a or "Emoticons" in unicode_range_b:
return False

# Latin characters can be accompanied with a combining diacritical mark
# eg. Vietnamese.
if "Latin" in unicode_range_a or "Latin" in unicode_range_b:
if "Combining" in unicode_range_a or "Combining" in unicode_range_b:
return False

keywords_range_a, keywords_range_b = unicode_range_a.split(
" "
), unicode_range_b.split(" ")
Expand Down

0 comments on commit 5c72742

Please sign in to comment.