New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52

ghost · 2015-02-15T14:54:54Z

Text in the hungarian language can't contain many english words inside detected text. For example xml files can have more english words because of tag names and others. This detector is based on the letter frequency.
The second problem arises if the hungarian text has many sentences in uppercase.

coveralls · 2015-02-15T14:55:44Z

Changes Unknown when pulling e1d6e68 on helour:master into * on chardet:master*.

dan-blanchard · 2015-02-15T16:16:01Z

chardet/langgreekmodel.py

-# 254: Carriage/Return
-# 253: symbol (punctuation) that does not belong to word
-# 252: 0 - 9
+Windows_1253_Greek_char_to_order_map = (


Why have the character-to-order maps changed? It seems to me that you would only need to update the language model itself.

It is due to automatic script 'create_language_model.py' for langmodel building/creation, which I've created. The name of the charmap tuple is automatically taken from the 'CharsetsTabs.txt' table.
I am too lazy to keep/rewrite all charmap names (for all language models) to the "old names".
Please look at the my new 'sbcsgroupprober.py'.
Btw: It is only a name like name of a variable. I think it can be any regular name :)

The content of the charmap table (values in the tuple) was changed because I was created lang model from other "training" text. Simple said my source text has other letters frequency (probability) table.

coveralls · 2015-02-15T18:56:15Z

Changes Unknown when pulling 3560b0e on helour:master into * on chardet:master*.

coveralls · 2015-02-15T18:56:16Z

Changes Unknown when pulling 3560b0e on helour:master into * on chardet:master*.

coveralls · 2015-02-15T19:34:59Z

Changes Unknown when pulling 8262b0f on helour:master into * on chardet:master*.

dan-blanchard · 2015-02-16T13:49:17Z

tests/iso-8859-2-hungarian/auto-apro.hu.xml

@@ -1,474 +0,0 @@
-<?xml version="1.0" encoding="iso-8859-2" ?>


Please don't convert the XML test files to plaintext. Part of chardet attempts to remove tags automatically, so these are important test cases.

Dou you have better solution?
Why plain text gives good detection and XML (hungarian, greek) wrong?
I carefully chose raw text (>10MB, not ancient) for the hungarian and greek language model building, but chardetect gives wrong charset name for some xml files. Something is wrong because for example greek alphabet doesn't contains latin letters.

I've found in the latin1prober.py:
def filter_with_english_letters(buf):
This filter can be applied to all scripts which contain both English characters and extended ASCII characters, but is currently only used by ``Latin1Prober``.

Here is the part of the "plain" hungarian text "honositomuhely.hu.xml' with automatically tag removed in the latin1prober:
title Honosító Műhely Legfrissebb title link http www honositomuhely hu link description description language language copyright Copyright C Herczeg József copyright pubDate Wed Jan pubDate item title Simply Calenders d fr title link http www honositomuhely hu index php option com remository Itemid func fileinfo filecatid parent link pubDate Wed Jan pubDate description description category category item item title PDF Download fw ÚJ title link http www honositomuhely hu index php option com remository Itemid func fileinfo filecatid parent link pubDate Wed Jan pubDate description description category category item item title VideoInspector fw fr title link http www honositomuhely hu index php option com remository Itemid func fileinfo filecatid parent.

The tag filtering is completely wrong and is implemented in the latin1prober only.
There are also problem with

My conclusion: "Some XML files can't serves for the charset detector quality testing, because it gives wrong impression that some language models are bad".

I completely agree that the tag filtering is terribly broken. You see, I only recently took over maintaining chardet about a year ago, and there are many things I haven't had the time to fix yet. We had a discussion recently about how bizarre the tag filtering is in #46.

Anyway, I think the real solution here would be to change the API such that there is an option chardet.detect() called remove_tags that would perform that tag filtering as a preprocessing step. We want something that works for plaintext as well as HTML/XML files.

Today I've made very simple filter for tags, urls, digits (!yes digits too!), and empty lines (or spaces at the beginning of the lines) in the universaldetector.py:

XML_ESC_MAP = (('>', b'>'), ('<', b'<'), ('&', b'&'), (''', b"'"), ('"', b'"'), (' ', b' ')) def remove_tags(self, txt): for esc, char in self.XML_ESC_MAP: txt = txt.replace(esc, char) txt = txt.replace('<![CDATA[', '') txt = txt.replace(']]>', '') txt = re.sub(br'', '', txt, flags=re.DOTALL) txt = re.sub(br'<\?*\/*[A-Z]+[^>]*>', '', txt, flags=re.IGNORECASE) return txt def remove_urls(self, txt): txt = re.sub(br'\b(?:(?:https?|ftp|file)://|www\.|ftp\.)' '(?:$[-A-Z0-9+&@#/%=~_|$?!:;,.]*$|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*' '(?:$[-A-Z0-9+&@#/%=~_|$?!:;,.]*$|[A-Z0-9+&@#/%=~_|$])', '', txt, flags=re.IGNORECASE) return txt def remove_digits(self, txt): txt = re.sub(br'[0-9]+', '', txt) return txt def remove_empty_lines(self, txt): txt = re.sub(br'^(\s|\t)*\n', '', txt, flags=re.MULTILINE) return txt

Before filtering:
Ran 413 tests in 97.956s FAILED (failures=26)
After filtering:
Ran 413 tests in 57.866s FAILED (failures=16)

dan-blanchard · 2015-02-16T14:10:45Z

Thanks for putting this together.

I get the impression you're fairly new to Python, as there are a bunch of little style issues, but I can take care of those. I also think I'll probably change your charset table file to be a JSON file in the end. I'll try to only comment on the substantive stuff.

dan-blanchard · 2015-02-16T14:14:11Z

chardet/sbcharsetprober.py

@@ -36,7 +36,7 @@ class SingleByteCharSetProber(CharSetProber):
    SB_ENOUGH_REL_THRESHOLD = 1024
    POSITIVE_SHORTCUT_THRESHOLD = 0.95
    NEGATIVE_SHORTCUT_THRESHOLD = 0.05
-    SYMBOL_CAT_ORDER = 250
+    SYMBOL_CAT_ORDER = 254


This was wrong before, but I think it should actually be 253, since the old comment said:

# 255: Control characters that usually does not exist in any text # 254: Carriage/Return # 253: symbol (punctuation) that does not belong to word # 252: 0 - 9

I need it to distinguish between iso-8859-2 (latin2) and cp1250 (windows-1250) for some central european languages. It is hard to explain in one sentence, but I can try:
I need to take account (in the sbcharsetprober.py: self._total_char += 1) some CE letters (like šžťśź) which are letters in the cp1250 but are symbols in the latin2. If my solution is not good, I can try to (manually) decrease value in charmap tables of the CE language models from 253 to 252. Exactly said: The first and the second solution are not perfect, but without this I can't distinguish (big thanks M$) between latin2 and cp1250.
Due to adding -1 it can be 253

coveralls · 2015-02-16T15:34:35Z

Changes Unknown when pulling c74022c on helour:master into * on chardet:master*.

coveralls · 2015-02-16T15:40:49Z

Changes Unknown when pulling ed5dce9 on helour:master into * on chardet:master*.

coveralls · 2015-02-16T16:20:54Z

Changes Unknown when pulling fa49042 on helour:master into * on chardet:master*.

coveralls · 2015-02-16T17:00:48Z

Changes Unknown when pulling 527c49e on helour:master into * on chardet:master*.

coveralls · 2015-02-16T17:00:48Z

Changes Unknown when pulling 527c49e on helour:master into * on chardet:master*.

coveralls · 2015-02-18T12:58:20Z