Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added #52

Closed
wants to merge 65 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Feb 15, 2015

Text in the hungarian language can't contain many english words inside detected text. For example xml files can have more english words because of tag names and others. This detector is based on the letter frequency.
The second problem arises if the hungarian text has many sentences in uppercase.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling e1d6e68 on helour:master into * on chardet:master*.

# 254: Carriage/Return
# 253: symbol (punctuation) that does not belong to word
# 252: 0 - 9
Windows_1253_Greek_char_to_order_map = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have the character-to-order maps changed? It seems to me that you would only need to update the language model itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is due to automatic script 'create_language_model.py' for langmodel building/creation, which I've created. The name of the charmap tuple is automatically taken from the 'CharsetsTabs.txt' table.
I am too lazy to keep/rewrite all charmap names (for all language models) to the "old names".
Please look at the my new 'sbcsgroupprober.py'.
Btw: It is only a name like name of a variable. I think it can be any regular name :)

The content of the charmap table (values in the tuple) was changed because I was created lang model from other "training" text. Simple said my source text has other letters frequency (probability) table.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 3560b0e on helour:master into * on chardet:master*.

1 similar comment
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 3560b0e on helour:master into * on chardet:master*.

@ghost ghost changed the title New language models added; old inacurate model was rebuilded. Hungarian test files changed. Script for language model building added New language models added; old inacurate models was rebuilded. Hungarian test files changed. Script for language model building added Feb 15, 2015
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 8262b0f on helour:master into * on chardet:master*.

@@ -1,474 +0,0 @@
<?xml version="1.0" encoding="iso-8859-2" ?>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't convert the XML test files to plaintext. Part of chardet attempts to remove tags automatically, so these are important test cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dou you have better solution?
Why plain text gives good detection and XML (hungarian, greek) wrong?
I carefully chose raw text (>10MB, not ancient) for the hungarian and greek language model building, but chardetect gives wrong charset name for some xml files. Something is wrong because for example greek alphabet doesn't contains latin letters.

I've found in the latin1prober.py:

def filter_with_english_letters(buf):

    This filter can be applied to all scripts which contain both English
    characters and extended ASCII characters, but is currently only used by
    ``Latin1Prober``.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the part of the "plain" hungarian text "honositomuhely.hu.xml' with automatically tag removed in the latin1prober:

title Honosító Műhely Legfrissebb title link http www honositomuhely hu link description description language language copyright Copyright C Herczeg József copyright pubDate Wed Jan pubDate item title Simply Calenders d fr title link http www honositomuhely hu index php option com remository Itemid func fileinfo filecatid parent link pubDate Wed Jan pubDate description description category category item item title PDF Download fw ÚJ title link http www honositomuhely hu index php option com remository Itemid func fileinfo filecatid parent link pubDate Wed Jan pubDate description description category category item item title VideoInspector fw fr title link http www honositomuhely hu index php option com remository Itemid func fileinfo filecatid parent.

The tag filtering is completely wrong and is implemented in the latin1prober only.
There are also problem with

My conclusion: "Some XML files can't serves for the charset detector quality testing, because it gives wrong impression that some language models are bad".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree that the tag filtering is terribly broken. You see, I only recently took over maintaining chardet about a year ago, and there are many things I haven't had the time to fix yet. We had a discussion recently about how bizarre the tag filtering is in #46.

Anyway, I think the real solution here would be to change the API such that there is an option chardet.detect() called remove_tags that would perform that tag filtering as a preprocessing step. We want something that works for plaintext as well as HTML/XML files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today I've made very simple filter for tags, urls, digits (!yes digits too!), and empty lines (or spaces at the beginning of the lines) in the universaldetector.py:

XML_ESC_MAP = (('&gt;', b'>'), ('&lt;', b'<'), ('&amp;', b'&'), 
               ('&apos;', b"'"), ('&quot;', b'"'), ('&nbsp;', b' '))

def remove_tags(self, txt):
    for esc, char in  self.XML_ESC_MAP:
    txt = txt.replace(esc, char)

    txt = txt.replace('<![CDATA[', '')
    txt = txt.replace(']]>', '')

    txt = re.sub(br'<!--([^>]*)-->', '', txt, flags=re.DOTALL)
    txt = re.sub(br'<\?*\/*[A-Z]+[^>]*>', '', txt, flags=re.IGNORECASE)
    return txt

def remove_urls(self, txt):
    txt = re.sub(br'\b(?:(?:https?|ftp|file)://|www\.|ftp\.)'
                 '(?:\([-A-Z0-9+&@#/%=~_|$?!:;,.]*\)|[-A-Z0-9+&@#\/%=~_|$?!:;,.])*'
                 '(?:\([-A-Z0-9+&@#/%=~_|$?!:;,.]*\)|[A-Z0-9+&@#/%=~_|$])', 
             '', txt, flags=re.IGNORECASE)
    return txt

def remove_digits(self, txt):
    txt = re.sub(br'[0-9]+', '', txt)
    return txt

def remove_empty_lines(self, txt):
    txt = re.sub(br'^(\s|\t)*\n', '', txt, flags=re.MULTILINE)
    return txt

Before filtering:

Ran 413 tests in 97.956s
FAILED (failures=26)

After filtering:

Ran 413 tests in 57.866s
FAILED (failures=16)

@dan-blanchard
Copy link
Member

Thanks for putting this together.

I get the impression you're fairly new to Python, as there are a bunch of little style issues, but I can take care of those. I also think I'll probably change your charset table file to be a JSON file in the end. I'll try to only comment on the substantive stuff.

@@ -36,7 +36,7 @@ class SingleByteCharSetProber(CharSetProber):
SB_ENOUGH_REL_THRESHOLD = 1024
POSITIVE_SHORTCUT_THRESHOLD = 0.95
NEGATIVE_SHORTCUT_THRESHOLD = 0.05
SYMBOL_CAT_ORDER = 250
SYMBOL_CAT_ORDER = 254
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was wrong before, but I think it should actually be 253, since the old comment said:

# 255: Control characters that usually does not exist in any text
# 254: Carriage/Return
# 253: symbol (punctuation) that does not belong to word
# 252: 0 - 9

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need it to distinguish between iso-8859-2 (latin2) and cp1250 (windows-1250) for some central european languages. It is hard to explain in one sentence, but I can try:
I need to take account (in the sbcharsetprober.py: self._total_char += 1) some CE letters (like šžťśź) which are letters in the cp1250 but are symbols in the latin2. If my solution is not good, I can try to (manually) decrease value in charmap tables of the CE language models from 253 to 252. Exactly said: The first and the second solution are not perfect, but without this I can't distinguish (big thanks M$) between latin2 and cp1250.

Due to adding -1 it can be 253

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling c74022c on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling ed5dce9 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling fa49042 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 527c49e on helour:master into * on chardet:master*.

1 similar comment
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 527c49e on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

1 similar comment
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.

1 similar comment
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5b01a38 on helour:master into * on chardet:master*.


out = b''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes are not mutable in Python, so switching this to using bytes and then concatenating to it with += means creating lots of temporary strings. BytesIO should be faster (although, feel free to prove me wrong).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are wrong, please try this:

import time

c = []
for i in range(0, 256):
    c.append(bytearray(i))

start = time.time()
from io import BytesIO
filtered = BytesIO()
for i in range(0, 10000000):
    filtered.write(c[i % 255])
ret = filtered.getvalue()
print time.time() - start

start = time.time()
s = b''
for i in range(0, 10000000):
    s += c[i % 255]
ret = s
print time.time() - start

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm... I just ran this and it completely confirmed my suspicions.

The BytesIO part (with Python 3) finished in 3.5 seconds, and the part using bytes with concatenation was running for over 5 minutes before I killed it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your suspicion is right for Python 3 but wrong for Python 2.7.
It is not good idea to create one project for various python's versions. There are many problems with compatibility and speed optimization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree supporting both Python versions is difficult, but I'm not quite willing to leave Python 2 users completely in the dust yet, since there are so many of them. Especially when the hard work for maintaining compatibility has mostly been done already.

That said, I'll definitely target Python 3 for optimizations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this new piece of code (Python 2 and 3 compatible) is what you need:

import time

c = []
for i in range(0, 256):
    c.append(bytearray(i))

start = time.time()
from io import BytesIO
filtered = BytesIO()
for i in range(0, 10000000):
    filtered.write(c[i % 255])
ret = filtered.getvalue()
print(time.time() - start)

start = time.time()
s = bytearray()
for i in range(0, 10000000):
    s.extend(c[i % 255])
ret = s
print(time.time() - start)

BTW the second part is still winner because don't use stream :D

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the suggestion. 👍

@sigmavirus24
Copy link
Member

Hey @helour there are a lot of changes here including ones by @dan-blanchard that are ostensibly re-committed from master. Further these changes can't be merged in their current state. You've also left review comments where there should be code comments. Please properly rebase this pull request and consider squashing these commits into a handful of better focused commits please. (Especially the ones that all have the same overly long subject line.)

Further, please add your review comments as code comments with better explanations than you left here. GitHub will likely not survive many more years and so all this data you left here will be useless in some future git-based hosting solution. The code-comments will be 100x more valuable.

Finally, please address @dan-blanchard's code review comments. In its current state, this pull request will be difficult to review and would be a good candidate for closing without prejudice if it could be split into much smaller change sets that will be easier to review.

@dan-blanchard
Copy link
Member

Well, it looks like I'm going to need to spend some time pulling in the good bits from this myself. We scared him off. 😦 I got an email saying:

It seems that github and Ian's rules aren't right for me.
I deleted my account from the github and I want to offer my latest work to you.
All yours today's suggestions are implemented.

@sigmavirus24
Copy link
Member

Were my requests that unreasonable?

@dan-blanchard
Copy link
Member

I didn't think so, but he was an old-school C programmer working with Python and GitHub for the first time, so I think his tolerance for feedback was low. There also seemed to be a slight language/cultural barrier that might have made him read things you said more harshly than you intended. It's hard enough to pick up on tone things when you're reading your native language!

@dan-blanchard
Copy link
Member

This has been replaced by #99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants