Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow str and None values for indent #518

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

Erotemic
Copy link
Contributor

@Erotemic Erotemic commented Apr 5, 2022

Fixes #517

Changes proposed in this pull request:

Allows indent to be specified as either None, an integer, or a string as long as it only contains spaces. This increases the API compatibility of ujson and Python's json, making it easier to work with ujson as a drop-in replacement.

As this is just one check at the start of the call, I don't expect that this would have any serious performance hit.

This also includes corresponding tests to ensure the str and integer way of calling ujson are equivalent.

@codecov-commenter
Copy link

codecov-commenter commented Apr 5, 2022

Codecov Report

Merging #518 (7859299) into main (f6860f1) will increase coverage by 0.10%.
The diff coverage is 94.73%.

❗ Current head 7859299 differs from pull request most recent head b7c4134. Consider uploading reports for the commit b7c4134 to get more accurate results

@@            Coverage Diff             @@
##             main     #518      +/-   ##
==========================================
+ Coverage   90.63%   90.73%   +0.10%     
==========================================
  Files           6        6              
  Lines        1783     1835      +52     
==========================================
+ Hits         1616     1665      +49     
- Misses        167      170       +3     
Impacted Files Coverage Δ
python/objToJSON.c 87.09% <88.88%> (+0.05%) ⬆️
lib/ultrajsonenc.c 85.12% <100.00%> (+0.15%) ⬆️
tests/test_ujson.py 99.60% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6860f1...b7c4134. Read the comment docs.

@Erotemic
Copy link
Contributor Author

Erotemic commented Apr 5, 2022

@JustAnotherArchivist The latest commit removes the ValueError on an non-pure space indent, and allows a completely custom UTF8 indent for full compatibility with Python's json.

I've verified that it works, but I am not a C expert, so please double check and critique my logic and implementation of this feature. You mentioned that it requires proper memory management in the corresponding issue, but IIUC, PyUnicode_AsUTF8AndSize and PyBytes_AsString do not require the user to deallocate the buffer, so I'm not sure what other memory management is needed.

@bwoodsend
Copy link
Collaborator

Mind rebasing this so as to include #519 (I imagine that there will be merge conflicts)? Then we can dig around for potential memory issues.

@hugovk
Copy link
Member

hugovk commented Apr 5, 2022

@bwoodsend I just turned this setting on in the repo:

image

Which gives us these buttons:

image

And it says there are no conflicts. So if you like, you can rebase it right here :)

Copy link
Collaborator

@JustAnotherArchivist JustAnotherArchivist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify on the memory management, I wasn't talking about the Python layer but the buffer overflow security issue that was just fixed in #519. The way it's implemented here, I think it shouldn't reintroduce any security issues. The buffer reservations use enc->indent for the size calculation, and that's set correctly for the string case, I think. (The optional newline is already always accounted for.) So that seems fine.

Some further comments below.

lib/ultrajson.h Outdated Show resolved Hide resolved
lib/ultrajson.h Outdated Show resolved Hide resolved
lib/ultrajsonenc.c Outdated Show resolved Hide resolved
}

*_outLen = PyBytes_Size(newObj);
return PyBytes_AsString(newObj);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned, PyUnicode_AsUTF8AndSize and PyBytes_AsString don't require dereferencing. But as far as I can tell, PyUnicode_AsUTF8String does. So there'd need to be a Py_DECREF(newObj) call between PyBytes_AsString and returning, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look more into that. Also wrt to this code, is there any better way of implementing the _PyUnicodeToChars logic? I took this logic from elsewhere in the file that was doing a similar, but not identical thing. I'm curious what best practices are here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this is indeed the way to do it. Looks like the other similar code keeps a reference to newObj and then later calls Py_XDECREF on it, by the way.

Also, sidenote: PyUnicode_AsUTF8AndSize is part of the stable ABI since Python 3.10, so once that's the minimum supported version, only the code in the #ifndef will be needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked at this again as I'm trying to fix the issues with surrogates. PyBytes_AsString returns a pointer to the internal char* buffer. This means the PyObject can't be DECREFd until the buffer is no longer needed. This is why the similar existing function stores the newObj in the context struct.

It should be possible to refactor PyUnicodeToUTF8 slightly to make it reusable for this (and it'll be useful for other things as well, e.g. separators). Then the PyObject would be kept and DECREFd at the end of objToJSON. I'll incorporate that refactor in my surrogate fix since it requires changes to that function anyway.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#530

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix for the surrogates unblocked me in terms of what I could do in this PR. Using PyUnicode_AsEncodedString was the key. I rebased on your branch so I could get those changes, and the tests are now at least working locally. Let's see how they do on the CI. This PR probably need a bit of cleanup because I did a bunch of print debugging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use my PyUnicodeToUTF8Raw, no need for the duplicate function. You will need to take care of DECREFing the PyBytes object at the end though as mentioned above (also in case of objToJSON returning early!).

python/objToJSON.c Outdated Show resolved Hide resolved
Comment on lines +284 to +288
output0c = ujson.encode(data, indent="")
output0d = ujson.encode(data, indent=0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These (and also a negative ints) should actually produce a different output than default and indent=None, namely inserting newlines but no indentation. However, that bug has been present for longer (#317), and if you don't want to tackle it, it could be done at a later time.

tests/test_ujson.py Show resolved Hide resolved
Comment on lines 564 to 565
for (i = 0; i < enc->indent; i++)
Buffer_AppendCharUnchecked(enc, enc->indent_chars[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance, this should probably be replaced with a memcpy, though an extra check in debug mode that this doesn't overrun the buffer would be a good idea. Something like this, I think:

Suggested change
for (i = 0; i < enc->indent; i++)
Buffer_AppendCharUnchecked(enc, enc->indent_chars[i]);
#ifdef DEBUG
if (enc->end - enc->offset < enc->indent)
{
fprintf(stderr, "Ran out of buffer space during Buffer_AppendIndentUnchecked()\n");
abort();
}
#endif
memcpy(enc->offset, enc->indent_chars, enc->indent);
enc->offset += enc->indent;

(Come to think of it, such checks would be a good idea on the other memcpys as well, but that's for another PR.)

A test for this that's basically a merger between test_dump_huge_indent and test_dump_long_string would be good as well. Similar variation, arranged such that it should hit a memory boundary. It's too late right now for me to think this through fully though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that of course enc->offset needs to be incremented by enc->indent after the copy as well. Fixed that now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#529 adds a Buffer_memcpy function that could be used instead once merged.

@Erotemic
Copy link
Contributor Author

Issues I'm trying to resolve:

indentChars can now contain NULL chars and be considered valid. Had to change a PyUnicode_FromString to PyUnicode_FromStringAndSize and keep track of the size. I haven't quite gotten that right yet.

Having an issue with:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udfff' in position 2: surrogates not allowed

I'm still working on this, albeit sparsely, If someone knows how to rework this to address corner cases, please go for it. I probably wont look at this again for at least a week.

@JustAnotherArchivist
Copy link
Collaborator

I commented on this before, but it's in a resolved comment, so repeating (and expanding) for visibility...

I'm not sure the surrogates or other weird characters in indent are anything to worry about. It would be nice to be completely generic, and the built-in json does support it, which is why I suggested it in the first place. But in practice, I can't think of any legitimate reason to do so. It obviously creates extra headaches and would anyway just produce invalid output as indentation can only consist of tabs and spaces (since those are the two characters, alongside line feeds and carriage returns, which are permissible as optional whitespace between two tokens in JSON).

On the other hand, there are issues with surrogates in values as well (#156, #447), so we need something that can carry those characters from PyUnicode to *char anyway. And if done right, it'd be trivial to reuse that and support this kind of JSON abuse for free.

@Erotemic Erotemic force-pushed the nonint_indent branch 3 times, most recently from 6d8ecf1 to 6861525 Compare April 18, 2022 01:24
@JustAnotherArchivist
Copy link
Collaborator

The tests are failing now due to the negative indentLength, I think. The buffer size calculations all use that variable, and when it's negative, it won't correctly reserve buffer space for the newline.

@Erotemic
Copy link
Contributor Author

That makes sense. I see the issue now. As a quick fix, I defined a max function and used max(indentLenght, 0) in those calculations, but it might make more sense to refactor this such that indentLength is always non-negative and usable in those calculations by adding an additional variable to keep track of if any indent was specified (in which case we add newlines) or if the indent was defaulted (in which case we don't). This would avoid any jumps incurred by the max branches and potentially be faster.

@Erotemic
Copy link
Contributor Author

Did a bit more work on this. I took out all of the max computations and instead ensured that indentLength is always non-negative. To do this I added a new indentEnabled flag to distinguish the case where the indentLength=0, but we still want to behave as though we are indenting (i.e. add newlines).

In a previous commit, there was a behavior change where extra spaces were added after separator chars, and this is now reverted (although that actually matches Python's output better, but it does waste space, so not sure if that's better or not).

I've cleaned up the script I've been using to check compatibility between ujson and json. It now reports that out of 6528 parameter combinations, there are no functional differences, and there are 224 superficial difference which are due to the aforementioned lack of whitespace ujson chooses not to put between separating characters. If we are going for byte-for-byte compatibility we may want to consider that, but if we don't care, then this PR is already more than good enough for my needs.

I think there is still an outstanding issue where I'm forgetting a Py_XDECREF somewhere, the possible usage of Buffer_memcpy from #529, and that this PR depends on surrogate fixes in #530, but I think this is close to ready.

Erotemic and others added 10 commits June 1, 2022 11:37
[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Fix compiler warnings

Update python/objToJSON.c

Co-authored-by: JustAnotherArchivist <JustAnotherArchivist@users.noreply.github.com>

camelCase

whitespace

Update tests/test_ujson.py

Co-authored-by: JustAnotherArchivist <JustAnotherArchivist@users.noreply.github.com>
@JustAnotherArchivist
Copy link
Collaborator

@Erotemic Are you still interested in working on this?

@Erotemic
Copy link
Contributor Author

I would still very much like the feature, but if anyone else wanted to continue (or entirely replace) my work I wouldn't mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow indent to be a str or None for compatibility with Python's json
5 participants