Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Python's Negative Exponent Padding Idiosyncrasy #620

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

peter-varo-hx
Copy link

@peter-varo-hx peter-varo-hx commented Jan 24, 2024

Changes proposed in this pull request:

  • Introduce a new option called zero_pad_negative_9_to_5_exponent which if enabled replicates the 0-padding behaviour of the json module in Python's standard library. That is, if the exponent is between -9 to -5 (both inclusive), json pads the exponent with a leading 0, e.g. 1e-5 becomes 1e-05. This compatibility option is set to False by default, which means it is fully backward compatible.

@bwoodsend
Copy link
Collaborator

bwoodsend commented Jan 25, 2024

Why would you want that extra zero? To make the output the same as the standard library json.dumps() I guess?

@peter-varo-hx
Copy link
Author

peter-varo-hx commented Jan 25, 2024

[...] To make the output the same as the standard library [...]?

That is precisely the reason. There are a few scenarios when this can be immensely useful, one of them might be certain optimisations. For instance, only fetching and comparing previously serialised JSON dumps' hashes as opposed to fetching the dumps themselves and comparing them as deserialised Python objects. The time difference between the two can easily be orders of magnitude (especially so when the dumps are particularly large). However if the two implementations produce the exact same outputs, then it does not matter which one produced them, these could be reliably hashed and the resulting hashes compared (assuming that the JSON Objects' keys are ordered, of course).

I believe the way I introduced this new option would not affect the performance of ujson in any way, when the option is disabled (the default) it only introduces a single if-check per dump / dumps invocation at the C level, when enabled worst case scenario it adds an additional while-cycle per double serialisation.

Do you have any objections against this?

@bwoodsend
Copy link
Collaborator

I'm just slightly spooked by the security. Anything that causes output strings to get longer can cause the rather fragile memory allocation model to underestimate what it needs and end up writing to out of bound memory locations (e.g. CVE-2021-45958). In this case though, looking at the code, I see that dconv_d2s() is doing the all the difficult stuff and we know the length of the string-ified decimal before we have to add it to the big JSON output string so there's no upper-bound estimation involved. Probably still worth putting it in the fuzz test (you can use one of the other boolean switches as a template).

I can't say that building a system which hinges on two JSON implementations producing identical outputs sounds wise to me but I'm OK with this new flag going in.

@bwoodsend bwoodsend added the changelog: Added For new features label Jan 26, 2024
@JustAnotherArchivist
Copy link
Collaborator

JustAnotherArchivist commented Jan 26, 2024

In this case though, looking at the code, I see that dconv_d2s() is doing the all the difficult stuff and we know the length of the string-ified decimal before we have to add it to the big JSON output string so there's no upper-bound estimation involved.

The code has a fixed 128-byte buffer for the dconv output, and I left this comment when I fixed that CVE:

The length of 128 is the worst case length of the Buffer_AppendDoubleDconv addition.

So it would be good to verify whether this worst case is still true with the padding.

Although Buffer_AppendDoubleDconv does pass the size of the buffer to dconv, and I would hope that dconv respects that limit.

@peter-varo-hx
Copy link
Author

peter-varo-hx commented Jan 26, 2024

[...] building a system which hinges on two JSON implementations producing identical outputs [does not] sound wise to me [...]

Couldn't agree more, and it doesn't, like I said, this is an optimisation in a background integration system that needs to crunch a ridiculous amount of data periodically, therefore the implementation takes every opportunity to save time and effort, in this case, it first checks if the hashes are identical and if they are, no further and more expensive checks are required, if not though, it needs to download gigabytes of data and compare them the proper way™. Now, when it needs to do this hundreds of thousands of times and it is unable to take advantage of the hashing, the time difference, like I said, is massive. (This problem alone in isolation could be solved differently, however there are other moving parts in this equation, e.g. keeping alive multiple systems in parallel, or the temporary inability to move away from a legacy system, etc.) Another problematic bit which I didn't mention but hinted earlier is when the exact same data leaves the system, becomes user-facing one, and arbitrary user integration depends on it. (Again, whether it should rely on unparsed, raw representation is a different question altogether, but as that does not happen within the system, one has no control over it.)

Either way...

Anything that causes output strings to get longer can cause the rather fragile memory allocation model to underestimate what it needs and end up writing to out of bound memory locations (e.g. CVE-2021-45958).

...this sounds quite problematic and serious, I'm going to have another look and add it to fuzzing as well, great shout. (Quite honestly though, I assumed that because the exponents length is internally limited to the magic value of 5, therefore my change could not cause any further trouble (and of course I also built on the assumption that if a feature is already implemented and exposed as part of the double-conversion then it is safe to hook into it and building on top of it).

@peter-varo-hx
Copy link
Author

Hey, sorry it took me so long to get back to you, but I was extremely busy in the last two weeks.

I added the case to the fuzz test, and It Works on My Machine™. I'm not exactly sure how indicative this brute fuzzing is, but at least it worked on my setup.

I looked into how the buffer bounds checks work, i.e. as @JustAnotherArchivist hinted, whether dconv respects the buffer size passed down to it when Buffer_AppendDoubleDconv is called. As far as I can tell (but I only have wealth of experience in Rust and C, not in C++, so take my words with a pinch of salt), it appears to me that the buffer which is explicitly created with the size 128 (N.B. this magic number is duplicated and can go out sync with other places) is being used when we create the StringBuilder, which internally checks the buffer's length (a Vector) on kinds of additions (e.g. AddCharacter, AddString, AddSubstring, etc.) so it made me believe that we're on the safe side here.

The only thing that made me worried is this comment:

// Problem: there is an assert in StringBuilder::AddSubstring() that
// will pass this buffer to strlen(), and this buffer is not generally
// null-terminated.

Which is horrible, but AFAICT that is only used to validate the input string of which the substring is supposed to be sliced from, so it has nothing to do with the buffer management.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (0d2d8c5) 92.34% compared to head (8d4e4c7) 92.37%.
Report is 1 commits behind head on main.

❗ Current head 8d4e4c7 differs from pull request most recent head 6491c18. Consider uploading reports for the commit 6491c18 to get more accurate results

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #620      +/-   ##
==========================================
+ Coverage   92.34%   92.37%   +0.02%     
==========================================
  Files           6        6              
  Lines        1908     1915       +7     
==========================================
+ Hits         1762     1769       +7     
  Misses        146      146              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bwoodsend
Copy link
Collaborator

That reminds me actually, I did try temporarily dropping that 128 to 15 which got me to a SIGABORT: *** stack smashing detected ***: terminated. Unless I'm doing something stupid here, that seams to imply that dconv doesn't have our backs?

On the flip side, I guess that if there is a hypothetical float input that can cause this to go wrong when zero_pad_negative_9_to_5_exponent is set, wouldn't that imply that you could multiply that float by 1e-10 to get a number of the same length which would blow up in the same way without zero_pad_negative_9_to_5_exponent?

@peter-varo-hx
Copy link
Author

Unless I'm doing something stupid here, that seams to imply that dconv doesn't have our backs?

Where did you change that? On L638 of ultrajsonenc.c? Because if you did it somewhere else, I wouldn't be surprised if all kinds of weird behaviour happened. But there, we pass down sizeof(buf) explicitly to dconv_d2s, which is then passed down explicitly on L26 in dconv_wrapper.cc to StringBuilder's constructor, which if I'm not mistaken creates the Vector with that buffer_size on L301 in utils.h in double-conversion. After which buffer_.length() is being used in every addition method for bound checking, which constantly returns the passed down value as you can see on L266 in utils.h. (That entire implementation of not differentiating between the capacity and the length, etc. is only asking for trouble, but this is what it is. I don't quite understand why that utils.h reimplements Vector instead of using std::vector or some other array-backed implementations to begin with.) So unless there's a massive bug somewhere in this entire thing, then I think it is impossible to write out of bounds going down this route.

On the flip side, I guess that if there is a hypothetical float input that can cause this to go wrong when zero_pad_negative_9_to_5_exponent is set, wouldn't that imply that you could multiply that float by 1e-10 to get a number of the same length which would blow up in the same way without zero_pad_negative_9_to_5_exponent?

That sounds plausible to me, but at the same time it is too late for me to think it through properly.

TBVH I think it would be great if someone could rework at least the StringBuilder related parts of this entire code base. It baffled me how fragile this entire thing is, not only based on your comments, but digging deeper in there. I mean, it feels like it can blow up anytime when someone tries to do something that we didn't do before, like my MR, using some functionality that is part of the code base but has never been utilised earlier.

@bwoodsend
Copy link
Collaborator

Where did you change that? On L638 of ultrajsonenc.c?

Yep

diff --git i/lib/ultrajsonenc.c w/lib/ultrajsonenc.c
index 9ec2faf..7cc36bf 100644
--- i/lib/ultrajsonenc.c
+++ w/lib/ultrajsonenc.c
@@ -635,7 +635,7 @@ static void Buffer_AppendUnsignedLongUnchecked(JSONObjectEncoder *enc, JSUINT64
 
 static int Buffer_AppendDoubleDconv(JSOBJ obj, JSONObjectEncoder *enc, double value)
 {
-  char buf[128];
+  char buf[15];
   int strlength;
 #ifdef DEBUG
   if ((size_t) (enc->end - enc->offset) < sizeof(buf)) {

TBVH I think it would be great if someone could rework at least the StringBuilder related parts of this entire code base.

double-conversion isn't our code – it's a fairly standard UNIX package/C++ library. If we edited it at all, we'd need to be careful since some downstream Linux packagers delete the bundled double-conversion in favour of linking against the system provided libdouble-conversion.so.

It baffled me how fragile this entire thing is,

Yeah. I do sometimes consider trying to rework some of it but then I remember that orjson supersedes ujson in almost every way and decide that the best fix is to just encourage users to switch. As much as rust's static linking/build all dependencies from source culture makes the UNIX enthusiast in me want to puke, it is a lot better at mashing strings of non predetermined size together without writing to out of bounds memory.

@peter-varo-hx
Copy link
Author

Yep

In that case I have no answers for you. In fact, I would be worried, not just because of my changes, but because of how this entire thing is not doing what it is supposed to based on how it is implemented. If we are really scared about this, than I might suggest to increase the size of that array on the stack. The next power of two would be too great, but taking into account that the maximum exponent size is hard-coded and considering all the other inlined constants there, we might be able to it make big enough that it would never fail. That being said, if it is really not respected further down the line, then what use would any of this have?

double-conversion isn't our code – it's a fairly standard UNIX package/C++ library. If we edited it at all, we'd need to be careful since some downstream Linux packagers delete the bundled double-conversion in favour of linking against the system provided libdouble-conversion.so.

I'm aware of all this, but 1. that doesn't mean it is good, and 2. it doesn't mean that it is a good practice in this day and age either. I do not wish to start a debate here on the ideas of a different era and directions of newer tools (as you referred to common *NIX practices and Rust ones) even though it would certainly make it an excellent discussion in a pub with a couple pints.

[...] then I remember that orjson supersedes ujson in almost every way and decide that the best fix is to just encourage users to switch.

That is actually an excellent shout and hinted by the benchmarks as well in the README. Of course that raises the question, why this project is still actively maintained and not archived?

@bwoodsend
Copy link
Collaborator

Of course that raises the question, why this project is still actively maintained and not archived?

I'd be for pulling the plug and putting in a deprecation. Not sure what the other maintainers think.

@hugovk
Copy link
Member

hugovk commented Feb 22, 2024

This still has a lot of downloads (10 million/month!) so I think it's worth maintaining, but I think we could say no more new features.

bwoodsend added a commit that referenced this pull request May 11, 2024
From the conversation on #620, let's try to steer people towards orjson
and away from writing pull requests that we're too paranoid to accept.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: Added For new features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants