Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FR: Change int repr on huge values to automatically use hexadecimal #96601

Closed
gpshead opened this issue Sep 5, 2022 · 5 comments
Closed

FR: Change int repr on huge values to automatically use hexadecimal #96601

gpshead opened this issue Sep 5, 2022 · 5 comments
Labels
3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@gpshead
Copy link
Member

gpshead commented Sep 5, 2022

Problem

Now that 95778 is in, the repr of an int can fail with a ValueError based on its size because repr and str are the same for int thus huge values cannot have a repr.

We discussed this while working on that security fix but deemed that changing a repr was way beyond reason for a patch release bugfix. Raising the ValueError exception highlights the point in the code that potentially needs specific attention rather than allowing a new unexpected format of data to start showing up where it hadn't previously as a result of a patch release.

Enhancement Proposal

We could fix this annoyance if we are willing to change int's repr. For huge values we could automatically repr them as hexadecimal. str behavior would not change.

The auto-hex repr point needs to be at less bits than required to represent a sys.int_info.str_digits_check_threshold decimal digit value so that there exists no scenario in which repr of an int could fail.

>>> int('1'+('0'*(sys.int_info.str_digits_check_threshold-1))).bit_count()
738
>>> int('1'+('0'*(sys.get_int_max_str_digits()-1))).bit_count()
4966

Perhaps all integers >512 bits (to pick an arbitrary nice threshold) could repr to hexadecimal:

>>> 2**511
6703903964971298549787012499102923063739682910296196688861780721860882015036773488400937149083451713845015929093243025426876941405973284973216824503042048
>>> 2**513
0x200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
>>> str(2**513)
'26815615859885194199148049996411692254958731641184786755447122887443528060147093953603748596333806855380063716372972101707507765623893139892867298012168192'

Effectively this behavior:

def proposed_int_repr(value: int):
    if value.bit_count() <= 512:
        return repr(value)
    else:
        return hex(value)

Potential wins

  1. We return to always being able to repr an int other than a MemoryError.
  2. Less hacky code is needed to see the actual value of an int when it is huge. Notebook users for example would see the result of their huge int computation instead of a ValueError. It'd just be in hex. (REPLs emit the repr)
    On the other hand, I expect notebooks may choose to implement this in their own REPL repr code long before it is released into a CPython version that they're run on top of.
  3. People don't need to check for int and implement their own specialized repr when they always want a value.
  4. Minor: People start using hexadecimal constants for huge values in code rather than decimal when they pasted them in from a REPL. Faster parsing, shorter code.

Potential disruption

  1. Golden value tests comparing string form data.
  2. Code inadvertently using the repr expecting to always get a decimal value. Bug in user code: Should use str.
  3. Stored reprs of data consumed at a distance by other code where it previously contained decimal values. Bug in user code: repr is not a data storage and transmission format.

If we didn't choose a low limit, but instead tied the switch over point to the largest binary value that fits within sys.get_int_str_max_digits() decimal digits we'd be inconsistent between environments or programs that choose to change their digits limit but would avoid emitting hexadecimal unless we had no other choice. This variant could be thought of as:

def digit_limit_tied_proposed_int_repr(value: int):
    try:
        return repr(value)
    except ValueError:
        return hex(value)
@gpshead gpshead added type-feature A feature request or enhancement interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.12 bugs and security fixes labels Sep 5, 2022
@gpshead
Copy link
Member Author

gpshead commented Sep 5, 2022

I assigned it to @mdickinson for thoughts, not necessarily to do it. This is potentially disruptive so it warrants discussion on Discuss first if we want to go forward with some version of this, if not ultimately a Steering Council decision if there is disagreement.

The unittests and documentation are more work than the implementation for the easy fixed transition point version.

The exact decimal threshold changeover point would be more work, it could involve keeping a PyLong around of the exact value to compare value > reference against for the hex transition point. Doing that is probably not worthwhile, just a rough estimate comparing size to the .bit_count() - 1 of the current decimal limit's maximum base 10 value or something similar would be close enough for that "avoid hex unless required purpose".

@mdickinson
Copy link
Member

mdickinson commented Sep 6, 2022

Some thoughts:

  • One of the joys of moving to Python 3 was that the repr and str of an int finally agreed (as did the repr and str of a float, post Python 3.1); it feels like a shame to lose that.
  • But I agree that having the repr at least work for large integers is better than raising, so on that basis alone I'm at least +0.1 on this proposal.
  • Both at work and at home, I have applications that use large integers but don't have the same security concerns that a user-facing web app has. I'd like those applications to be able to stick their heads in the sand and pretend that this change never happened. That's possible right now since the change can be disabled at various levels, but if we have a fixed transition point for the repr of an int (rather than something tied to the current limit) then it'll no longer be possible.
  • On the other hand, on general principles, having the repr of an int be dependent on context (so no longer being a pure function) is really icky.
  • If we do do this, and if we tie it to the current int_max_str_digits threshold rather than having a fixed threshold, I think we can and should make the transition point exactly equal to the point where str starts raising; that seems like it would provide the clearest user experience, and I think it could be implemented efficiently with at worst a very small handful of cases where we compute the decimal version of the string and then discard it in favour of the hex version.

Having offered those thoughts, I'll unassign myself. But I'd dearly love to hear from others, and especially those with close ties to Objects/longobject.c (@tim-one, @rhettinger, @serhiy-storchaka, @gvanrossum).

@mdickinson mdickinson removed their assignment Sep 6, 2022
@mdickinson
Copy link
Member

  1. Code inadvertently using the repr expecting to always get a decimal value. Bug in user code: Should use str.

Worth noting that the fact that the str of a collection generally uses repr for the collection items might make this non-trivial: there's no obvious spelling for "use str all the way down, please".

@serhiy-storchaka
Copy link
Member

I think that it is better to get exception right here, than produce an unexpected output which would break the same or other program when it tries to parse it.

Many years ago I considered the idea of using hexadecimals for long integers in pickle protocol 0. Not because potential DOS (pickles are already more unsafe), but for performance. But it would break compatibility with older Python versions and interoperability with other programs (there are implementations of pickle protocol 0 in non-Python programming languages). The only reason of using protocol 0 is compatibility.

@gpshead
Copy link
Member Author

gpshead commented Sep 27, 2022

Given those last two comments from Mark and Serhiy, I'm closing this one as "Infeasible" as reprs wind up in all sorts of places beyond people's control so doing this could just further surprises at a place removed from the code that produced the surprising data.

@gpshead gpshead closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants