Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly speed up execution time of common cases #142

Merged
merged 6 commits into from Nov 15, 2021

Conversation

ofek
Copy link
Contributor

@ofek ofek commented Nov 14, 2021

I'm wrapping up re-writing a CLI tool where responsiveness is critical and during the switch from toml to tomli I noticed load time increased by ~1-2 milliseconds. This PR brings it down the equivalent amount.

If you don't like the singleton you can eventually use https://www.python.org/dev/peps/pep-0562/ but I haven't benchmarked that way.

@hukkin
Copy link
Owner

hukkin commented Nov 14, 2021

Thanks for the PR!

I believe this should mostly be helpful in CLI invocations where TOML isn't actually parsed (but import tomli still runs). The parsing of almost any real world TOML file will probably still run the code path where regexes are compiled. Do you agree?

If the assumption is correct, then maybe it is more reasonable that applications where this 1-2 milliseconds is critical make this optimization themselves by importing Tomli lazily, only when parsing actually happens?

If we still come to the conclusion that this optimization makes sense, what do you think about implementing this with something like

@lru_cache(mazsize=None)
def regex(name: str) -> "re.Pattern":
    if name == "datetime":
        ...
    elif name == "number":
        ...
    ...

in _re.py. And then

from tomli._re import regex
# Just an example
regex("datetime").match(src, pos)

in _parser.py

Then we'd not need the singleton and also slightly less dunder magic.

@hukkin
Copy link
Owner

hukkin commented Nov 14, 2021

Btw triggering GitHub actions via pre-commit.ci is absolutely genius 😄 Had not realized one could do that. Will start using that trick myself if I ever need it.

@ofek
Copy link
Contributor Author

ofek commented Nov 14, 2021

Hmm, good point. I hadn't read the code thoroughly: so the patterns are always used for every invocation of loads()?

@hukkin
Copy link
Owner

hukkin commented Nov 14, 2021

In the current state, if the TOML file contains a value assignment of a number (float, integer, hex, etc.), inline array, or inline table, then all regexes will be compiled. If these are not assigned but a datetime or localtime is, then some, but not all regexes will be compiled.

I assume there's not many TOML files where for example no integers or floats are assigned.

EDIT: my assumption may be wrong. E.g. there's no floats or integers in https://github.com/hukkin/tomli/blob/master/pyproject.toml There are inline arrays, but we could move priority of inline arrays in parse_value function to fix the issue with them

@ofek
Copy link
Contributor Author

ofek commented Nov 14, 2021

Ah okay thanks! Yes my pyproject.toml files also have no floats nor integers. This is what I'm seeing:

$ python -m timeit -n 1 -r 1 -s "from pathlib import Path;t=Path('pyproject.toml').read_text()" "import toml;toml.loads(t)"
1 loop, best of 1: 7.36 msec per loop
$ python -m timeit -n 1 -r 1 -s "from pathlib import Path;t=Path('pyproject.toml').read_text()" "import tomli;tomli.loads(t)"
1 loop, best of 1: 8.68 msec per loop

Co-Authored-By: Taneli Hukkinen <3275109+hukkin@users.noreply.github.com>
@ofek
Copy link
Contributor Author

ofek commented Nov 15, 2021

Lazy compilation of the regular expressions + your idea of re-ordering makes the first run as fast as toml even with numbers! I just pushed a commit.

edit: eh, it seems number regex needs to go after dates unfortunately

@ofek ofek changed the title Lazily compile regular expressions to speed up load time Speed up load time, and execution time of common cases Nov 15, 2021
@hukkin
Copy link
Owner

hukkin commented Nov 15, 2021

Great! Would you mind using the "lru_cache" implementation I described earlier? I'd much prefer it over "setattr/_getattr_/singleton" if there's no significant performance or other drawbacks.

No worries about bad commit history or force-pushes btw, I will do a squash when we merge this anyways.

@hukkin
Copy link
Owner

hukkin commented Nov 15, 2021

Also, would be great if you could confirm that this actually fixes the responsiveness issue of your CLI tool (a real world problem), and that the issue is not fixable by lazily importing Tomli.

This is already in a territory of micro-optimizations where we have to do trade-offs: we sacrifice some performance of parsing large TOML files and TOML files with numbers, to achieve better performance with small files without numbers. I don't think that this is that obvious trade-off to make, so would be much happier merging if at least we fix your real-world problem.

@ofek
Copy link
Contributor Author

ofek commented Nov 15, 2021

If there are no numbers then this fixes it, otherwise no. If only the number pattern wasn't greedy and could come before dates since dates aren't that common 😄

However, I think it would still be beneficial overall to incorporate your idea and move those 2 simple equality checks to before those 3 regex searches. Would you like me to keep that one change?

@ofek ofek changed the title Speed up load time, and execution time of common cases Slightly speed up execution time of common cases Nov 15, 2021
@hukkin hukkin merged commit 809a8ae into hukkin:master Nov 15, 2021
@hukkin
Copy link
Owner

hukkin commented Nov 15, 2021

Thanks!

@ofek ofek deleted the load-time branch November 15, 2021 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants