Interning strings #139

Changaco · 2022-05-31T09:19:18Z

It would be nice if CBORDecoder had an option to intern decoded strings, especially dict keys since they're the most likely to appear multiple times both within the CBOR data and in the Python code exploiting that data. The C version of the module would provide a small performance boost by calling PyUnicode_InternInPlace instead of the Python function.

The text was updated successfully, but these errors were encountered:

Sekenre · 2022-06-01T13:39:34Z

That's a helpful idea thanks!

agronholm · 2022-06-01T21:06:48Z

How do you choose which strings get interned?

Changaco · 2022-06-02T08:29:38Z

I was thinking users would choose depending on the kind of data they're loading. Basically, CBORDecoder could have an intern option with four possible values: 'all' would intern all strings, 'keys' would intern all dict keys, 'reuse' would reuse strings which are already interned (this would only work in the C version of the module, because as far as I know Python code can't check whether a string is interned without interning it), 'none' would be the default and preserve current behaviour.

P.S. I have no immediate need for this feature and don't plan to work on it myself in the foreseeable future, I only opened this issue to share the idea.

Sekenre · 2022-06-02T11:30:39Z

I've been reading up on interning here: https://stackabuse.com/guide-to-string-interning-in-python/ and the disadvantages might slow down the decoder.

It depends very much on your data, If there are multiple dicts with duplicate keys and you need key lookup to be fast, then it makes sense to do this. Otherwise, storing dict keys in the interned dictionary will use more memory. It also doesn't help if your dict keys are another datatype like int or bytes.

I'll have to do some testing. Thanks for bringing this up @Changaco it was a good excuse for me to learn more cPython internals.

Changaco · 2022-06-03T21:24:48Z

The possibilities of significant slowdowns I can think of are:

if the decoder tries to intern very long strings, because the hashing time is proportional to the string length, though this only amounts to ~65ms per 100 million characters on my laptop;
if the decoder tries to intern a very large number of strings, particularly if a lot of them end up in the same slot in the interned dict due to hash collisions.

Of course these caveats can be documented so that users can choose the best options for their use cases.

Sekenre added the enhancement label Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interning strings #139

Interning strings #139

Changaco commented May 31, 2022 •

edited

Sekenre commented Jun 1, 2022

agronholm commented Jun 1, 2022

Changaco commented Jun 2, 2022

Sekenre commented Jun 2, 2022

Changaco commented Jun 3, 2022

Interning strings #139

Interning strings #139

Comments

Changaco commented May 31, 2022 • edited

Sekenre commented Jun 1, 2022

agronholm commented Jun 1, 2022

Changaco commented Jun 2, 2022

Sekenre commented Jun 2, 2022

Changaco commented Jun 3, 2022

Changaco commented May 31, 2022 •

edited