Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert single-byte charset probers to use nested dicts for language models #121

Merged
merged 3 commits into from Apr 27, 2017

Conversation

dan-blanchard
Copy link
Member

This isolates one of the major changes in #99, which is changing our single-byte charset prober language model format to use nested dicts instead of giant lists and offset math. This makes the code much easier to understand and language model access takes about 60% of the time it used to.

# Current
In [40]: %timeit [lang_model_tuple[(i * 64) + j % 64] for i, j in random_index_tuples]
10000 loops, best of 3: 168 µs per loop

# Same as current but language model is a list instead of a tuple
In [41]: %timeit [lang_model_list[(i * 64) + j % 64] for i, j in random_index_tuples]
10000 loops, best of 3: 170 µs per loop

# Single dictionary, but keys are tuples
In [42]: %timeit [lang_model_tup_dict[(i, j)] for i, j in random_index_tuples]
10000 loops, best of 3: 140 µs per loop

# Nested dictionary like in this PR
In [43]: %timeit [lang_model_nested_dict[i][j] for i, j in random_index_tuples]
10000 loops, best of 3: 99.4 µs per loop

The language model conversion script I've included in this PR does not need to stick around in master long term; I just wanted it here for review, since looking through the code that converts the language models and seeing if that looks right is much easier than visually comparing giant language model files.

I'm still seeing some test failures on this branch where Hungarian is being over-predicted, so this isn't quite ready to merge yet, but I figured putting it up here someone might notice something I missed.

@dan-blanchard dan-blanchard changed the title Convert single-byte charset probers to use dicts of dicts for language models Convert single-byte charset probers to use nested dicts for language models Apr 24, 2017
Copy link
Member

@sigmavirus24 sigmavirus24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the only performance benefit to using iteritems anywhere is if you have a dictionary with millions of item pairs. If that is where we are (GitHub won't show these diffs) then that's fine, otherwise, I'd rather we just use .items() everywhere. Either way, this looks great. 🎉 🍰:sparkles:

@dan-blanchard dan-blanchard merged commit 6aeaeb4 into master Apr 27, 2017
@dan-blanchard dan-blanchard deleted the feature/new_style_sbcs_models branch April 27, 2017 17:44
dan-blanchard added a commit that referenced this pull request Jun 8, 2017
…models (#121)

* Convert single byte charset modules to use dicts of dicts for language modules

- Also provide conversion script

* Fix debug logging check

* Keep Hungarian commented out until we retrain
@gimyjendirx
Copy link

This isolates one of the major changes in #99, which is changing our single-byte charset prober language model format to use nested dicts instead of giant lists and offset math. This makes the code much easier to understand and language model access takes about 60% of the time it used to.

# Current
In [40]: %timeit [lang_model_tuple[(i * 64) + j % 64] for i, j in random_index_tuples]
10000 loops, best of 3: 168 µs per loop

# Same as current but language model is a list instead of a tuple
In [41]: %timeit [lang_model_list[(i * 64) + j % 64] for i, j in random_index_tuples]
10000 loops, best of 3: 170 µs per loop

# Single dictionary, but keys are tuples
In [42]: %timeit [lang_model_tup_dict[(i, j)] for i, j in random_index_tuples]
10000 loops, best of 3: 140 µs per loop

# Nested dictionary like in this PR
In [43]: %timeit [lang_model_nested_dict[i][j] for i, j in random_index_tuples]
10000 loops, best of 3: 99.4 µs per loop

The language model conversion script I've included in this PR does not need to stick around in master long term; I just wanted it here for review, since looking through the code that converts the language models and seeing if that looks right is much easier than visually comparing giant language model files.

I'm still seeing some test failures on this branch where Hungarian is being over-predicted, so this isn't quite ready to merge yet, but I figured putting it up here someone might notice something I missed.

Marcopolo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants