New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Python's Negative Exponent Padding Idiosyncrasy #620
base: main
Are you sure you want to change the base?
Conversation
Why would you want that extra zero? To make the output the same as the standard library |
That is precisely the reason. There are a few scenarios when this can be immensely useful, one of them might be certain optimisations. For instance, only fetching and comparing previously serialised JSON dumps' hashes as opposed to fetching the dumps themselves and comparing them as deserialised Python objects. The time difference between the two can easily be orders of magnitude (especially so when the dumps are particularly large). However if the two implementations produce the exact same outputs, then it does not matter which one produced them, these could be reliably hashed and the resulting hashes compared (assuming that the JSON Objects' keys are ordered, of course). I believe the way I introduced this new option would not affect the performance of Do you have any objections against this? |
I'm just slightly spooked by the security. Anything that causes output strings to get longer can cause the rather fragile memory allocation model to underestimate what it needs and end up writing to out of bound memory locations (e.g. CVE-2021-45958). In this case though, looking at the code, I see that I can't say that building a system which hinges on two JSON implementations producing identical outputs sounds wise to me but I'm OK with this new flag going in. |
The code has a fixed 128-byte buffer for the dconv output, and I left this comment when I fixed that CVE: Line 719 in 0d2d8c5
So it would be good to verify whether this worst case is still true with the padding. Although |
Couldn't agree more, and it doesn't, like I said, this is an optimisation in a background integration system that needs to crunch a ridiculous amount of data periodically, therefore the implementation takes every opportunity to save time and effort, in this case, it first checks if the hashes are identical and if they are, no further and more expensive checks are required, if not though, it needs to download gigabytes of data and compare them the proper way™. Now, when it needs to do this hundreds of thousands of times and it is unable to take advantage of the hashing, the time difference, like I said, is massive. (This problem alone in isolation could be solved differently, however there are other moving parts in this equation, e.g. keeping alive multiple systems in parallel, or the temporary inability to move away from a legacy system, etc.) Another problematic bit which I didn't mention but hinted earlier is when the exact same data leaves the system, becomes user-facing one, and arbitrary user integration depends on it. (Again, whether it should rely on unparsed, raw representation is a different question altogether, but as that does not happen within the system, one has no control over it.) Either way...
...this sounds quite problematic and serious, I'm going to have another look and add it to fuzzing as well, great shout. (Quite honestly though, I assumed that because the exponents length is internally limited to the magic value of 5, therefore my change could not cause any further trouble (and of course I also built on the assumption that if a feature is already implemented and exposed as part of the double-conversion then it is safe to hook into it and building on top of it). |
Hey, sorry it took me so long to get back to you, but I was extremely busy in the last two weeks. I added the case to the fuzz test, and It Works on My Machine™. I'm not exactly sure how indicative this brute fuzzing is, but at least it worked on my setup. I looked into how the buffer bounds checks work, i.e. as @JustAnotherArchivist hinted, whether dconv respects the buffer size passed down to it when The only thing that made me worried is this comment: ultrajson/deps/double-conversion/double-conversion/double-to-string.cc Lines 259 to 261 in 45e00b3
Which is horrible, but AFAICT that is only used to validate the input string of which the substring is supposed to be sliced from, so it has nothing to do with the buffer management. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #620 +/- ##
==========================================
+ Coverage 92.34% 92.37% +0.02%
==========================================
Files 6 6
Lines 1908 1915 +7
==========================================
+ Hits 1762 1769 +7
Misses 146 146 ☔ View full report in Codecov by Sentry. |
That reminds me actually, I did try temporarily dropping that 128 to 15 which got me to a On the flip side, I guess that if there is a hypothetical float input that can cause this to go wrong when |
Where did you change that? On L638 of
That sounds plausible to me, but at the same time it is too late for me to think it through properly. TBVH I think it would be great if someone could rework at least the |
Yep diff --git i/lib/ultrajsonenc.c w/lib/ultrajsonenc.c
index 9ec2faf..7cc36bf 100644
--- i/lib/ultrajsonenc.c
+++ w/lib/ultrajsonenc.c
@@ -635,7 +635,7 @@ static void Buffer_AppendUnsignedLongUnchecked(JSONObjectEncoder *enc, JSUINT64
static int Buffer_AppendDoubleDconv(JSOBJ obj, JSONObjectEncoder *enc, double value)
{
- char buf[128];
+ char buf[15];
int strlength;
#ifdef DEBUG
if ((size_t) (enc->end - enc->offset) < sizeof(buf)) {
Yeah. I do sometimes consider trying to rework some of it but then I remember that orjson supersedes ujson in almost every way and decide that the best fix is to just encourage users to switch. As much as rust's static linking/build all dependencies from source culture makes the UNIX enthusiast in me want to puke, it is a lot better at mashing strings of non predetermined size together without writing to out of bounds memory. |
In that case I have no answers for you. In fact, I would be worried, not just because of my changes, but because of how this entire thing is not doing what it is supposed to based on how it is implemented. If we are really scared about this, than I might suggest to increase the size of that array on the stack. The next power of two would be too great, but taking into account that the maximum exponent size is hard-coded and considering all the other inlined constants there, we might be able to it make big enough that it would never fail. That being said, if it is really not respected further down the line, then what use would any of this have?
I'm aware of all this, but 1. that doesn't mean it is good, and 2. it doesn't mean that it is a good practice in this day and age either. I do not wish to start a debate here on the ideas of a different era and directions of newer tools (as you referred to common *NIX practices and Rust ones) even though it would certainly make it an excellent discussion in a pub with a couple pints.
That is actually an excellent shout and hinted by the benchmarks as well in the README. Of course that raises the question, why this project is still actively maintained and not archived? |
I'd be for pulling the plug and putting in a deprecation. Not sure what the other maintainers think. |
This still has a lot of downloads (10 million/month!) so I think it's worth maintaining, but I think we could say no more new features. |
From the conversation on #620, let's try to steer people towards orjson and away from writing pull requests that we're too paranoid to accept.
Changes proposed in this pull request:
zero_pad_negative_9_to_5_exponent
which if enabled replicates the0
-padding behaviour of thejson
module in Python's standard library. That is, if the exponent is between-9
to-5
(both inclusive),json
pads the exponent with a leading0
, e.g.1e-5
becomes1e-05
. This compatibility option is set toFalse
by default, which means it is fully backward compatible.