Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interning strings #139

Open
Changaco opened this issue May 31, 2022 · 5 comments
Open

Interning strings #139

Changaco opened this issue May 31, 2022 · 5 comments

Comments

@Changaco
Copy link
Contributor

Changaco commented May 31, 2022

It would be nice if CBORDecoder had an option to intern decoded strings, especially dict keys since they're the most likely to appear multiple times both within the CBOR data and in the Python code exploiting that data. The C version of the module would provide a small performance boost by calling PyUnicode_InternInPlace instead of the Python function.

@Sekenre
Copy link
Collaborator

Sekenre commented Jun 1, 2022

That's a helpful idea thanks!

@agronholm
Copy link
Owner

How do you choose which strings get interned?

@Changaco
Copy link
Contributor Author

Changaco commented Jun 2, 2022

I was thinking users would choose depending on the kind of data they're loading. Basically, CBORDecoder could have an intern option with four possible values: 'all' would intern all strings, 'keys' would intern all dict keys, 'reuse' would reuse strings which are already interned (this would only work in the C version of the module, because as far as I know Python code can't check whether a string is interned without interning it), 'none' would be the default and preserve current behaviour.

P.S. I have no immediate need for this feature and don't plan to work on it myself in the foreseeable future, I only opened this issue to share the idea.

@Sekenre
Copy link
Collaborator

Sekenre commented Jun 2, 2022

I've been reading up on interning here: https://stackabuse.com/guide-to-string-interning-in-python/ and the disadvantages might slow down the decoder.

It depends very much on your data, If there are multiple dicts with duplicate keys and you need key lookup to be fast, then it makes sense to do this. Otherwise, storing dict keys in the interned dictionary will use more memory. It also doesn't help if your dict keys are another datatype like int or bytes.

I'll have to do some testing. Thanks for bringing this up @Changaco it was a good excuse for me to learn more cPython internals.

@Changaco
Copy link
Contributor Author

Changaco commented Jun 3, 2022

The possibilities of significant slowdowns I can think of are:

  • if the decoder tries to intern very long strings, because the hashing time is proportional to the string length, though this only amounts to ~65ms per 100 million characters on my laptop;
  • if the decoder tries to intern a very large number of strings, particularly if a lot of them end up in the same slot in the interned dict due to hash collisions.

Of course these caveats can be documented so that users can choose the best options for their use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants