New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow str and None values for indent #518
base: main
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## main #518 +/- ##
==========================================
+ Coverage 90.63% 90.73% +0.10%
==========================================
Files 6 6
Lines 1783 1835 +52
==========================================
+ Hits 1616 1665 +49
- Misses 167 170 +3
Continue to review full report at Codecov.
|
@JustAnotherArchivist The latest commit removes the ValueError on an non-pure space indent, and allows a completely custom UTF8 indent for full compatibility with Python's json. I've verified that it works, but I am not a C expert, so please double check and critique my logic and implementation of this feature. You mentioned that it requires proper memory management in the corresponding issue, but IIUC, |
Mind rebasing this so as to include #519 (I imagine that there will be merge conflicts)? Then we can dig around for potential memory issues. |
@bwoodsend I just turned this setting on in the repo: Which gives us these buttons: And it says there are no conflicts. So if you like, you can rebase it right here :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify on the memory management, I wasn't talking about the Python layer but the buffer overflow security issue that was just fixed in #519. The way it's implemented here, I think it shouldn't reintroduce any security issues. The buffer reservations use enc->indent
for the size calculation, and that's set correctly for the string case, I think. (The optional newline is already always accounted for.) So that seems fine.
Some further comments below.
} | ||
|
||
*_outLen = PyBytes_Size(newObj); | ||
return PyBytes_AsString(newObj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned, PyUnicode_AsUTF8AndSize
and PyBytes_AsString
don't require dereferencing. But as far as I can tell, PyUnicode_AsUTF8String
does. So there'd need to be a Py_DECREF(newObj)
call between PyBytes_AsString
and returning, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look more into that. Also wrt to this code, is there any better way of implementing the _PyUnicodeToChars
logic? I took this logic from elsewhere in the file that was doing a similar, but not identical thing. I'm curious what best practices are here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this is indeed the way to do it. Looks like the other similar code keeps a reference to newObj
and then later calls Py_XDECREF
on it, by the way.
Also, sidenote: PyUnicode_AsUTF8AndSize
is part of the stable ABI since Python 3.10, so once that's the minimum supported version, only the code in the #ifndef
will be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked at this again as I'm trying to fix the issues with surrogates. PyBytes_AsString
returns a pointer to the internal char*
buffer. This means the PyObject
can't be DECREF
d until the buffer is no longer needed. This is why the similar existing function stores the newObj
in the context struct.
It should be possible to refactor PyUnicodeToUTF8
slightly to make it reusable for this (and it'll be useful for other things as well, e.g. separators
). Then the PyObject
would be kept and DECREF
d at the end of objToJSON
. I'll incorporate that refactor in my surrogate fix since it requires changes to that function anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
→ #530
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix for the surrogates unblocked me in terms of what I could do in this PR. Using PyUnicode_AsEncodedString
was the key. I rebased on your branch so I could get those changes, and the tests are now at least working locally. Let's see how they do on the CI. This PR probably need a bit of cleanup because I did a bunch of print debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use my PyUnicodeToUTF8Raw
, no need for the duplicate function. You will need to take care of DECREF
ing the PyBytes
object at the end though as mentioned above (also in case of objToJSON
returning early!).
output0c = ujson.encode(data, indent="") | ||
output0d = ujson.encode(data, indent=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These (and also a negative ints) should actually produce a different output than default and indent=None
, namely inserting newlines but no indentation. However, that bug has been present for longer (#317), and if you don't want to tackle it, it could be done at a later time.
lib/ultrajsonenc.c
Outdated
for (i = 0; i < enc->indent; i++) | ||
Buffer_AppendCharUnchecked(enc, enc->indent_chars[i]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance, this should probably be replaced with a memcpy
, though an extra check in debug mode that this doesn't overrun the buffer would be a good idea. Something like this, I think:
for (i = 0; i < enc->indent; i++) | |
Buffer_AppendCharUnchecked(enc, enc->indent_chars[i]); | |
#ifdef DEBUG | |
if (enc->end - enc->offset < enc->indent) | |
{ | |
fprintf(stderr, "Ran out of buffer space during Buffer_AppendIndentUnchecked()\n"); | |
abort(); | |
} | |
#endif | |
memcpy(enc->offset, enc->indent_chars, enc->indent); | |
enc->offset += enc->indent; |
(Come to think of it, such checks would be a good idea on the other memcpy
s as well, but that's for another PR.)
A test for this that's basically a merger between test_dump_huge_indent
and test_dump_long_string
would be good as well. Similar variation, arranged such that it should hit a memory boundary. It's too late right now for me to think this through fully though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot that of course enc->offset
needs to be incremented by enc->indent
after the copy as well. Fixed that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#529 adds a Buffer_memcpy
function that could be used instead once merged.
Issues I'm trying to resolve: indentChars can now contain NULL chars and be considered valid. Had to change a PyUnicode_FromString to PyUnicode_FromStringAndSize and keep track of the size. I haven't quite gotten that right yet. Having an issue with: I'm still working on this, albeit sparsely, If someone knows how to rework this to address corner cases, please go for it. I probably wont look at this again for at least a week. |
I commented on this before, but it's in a resolved comment, so repeating (and expanding) for visibility... I'm not sure the surrogates or other weird characters in On the other hand, there are issues with surrogates in values as well (#156, #447), so we need something that can carry those characters from |
6d8ecf1
to
6861525
Compare
The tests are failing now due to the negative |
That makes sense. I see the issue now. As a quick fix, I defined a max function and used |
Did a bit more work on this. I took out all of the In a previous commit, there was a behavior change where extra spaces were added after separator chars, and this is now reverted (although that actually matches Python's output better, but it does waste space, so not sure if that's better or not). I've cleaned up the script I've been using to check compatibility between I think there is still an outstanding issue where I'm forgetting a |
[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Fix compiler warnings Update python/objToJSON.c Co-authored-by: JustAnotherArchivist <JustAnotherArchivist@users.noreply.github.com> camelCase whitespace Update tests/test_ujson.py Co-authored-by: JustAnotherArchivist <JustAnotherArchivist@users.noreply.github.com>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
@Erotemic Are you still interested in working on this? |
I would still very much like the feature, but if anyone else wanted to continue (or entirely replace) my work I wouldn't mind. |
Fixes #517
Changes proposed in this pull request:
Allows indent to be specified as either None, an integer, or a string as long as it only contains spaces. This increases the API compatibility of ujson and Python's json, making it easier to work with ujson as a drop-in replacement.
As this is just one check at the start of the call, I don't expect that this would have any serious performance hit.
This also includes corresponding tests to ensure the str and integer way of calling ujson are equivalent.