Allow str and None values for indent #518

Erotemic · 2022-04-05T00:06:16Z

Fixes #517

Changes proposed in this pull request:

Allows indent to be specified as either None, an integer, or a string as long as it only contains spaces. This increases the API compatibility of ujson and Python's json, making it easier to work with ujson as a drop-in replacement.

As this is just one check at the start of the call, I don't expect that this would have any serious performance hit.

This also includes corresponding tests to ensure the str and integer way of calling ujson are equivalent.

codecov-commenter · 2022-04-05T01:05:55Z

Codecov Report

Merging #518 (7859299) into main (f6860f1) will increase coverage by 0.10%.
The diff coverage is 94.73%.

❗ Current head 7859299 differs from pull request most recent head b7c4134. Consider uploading reports for the commit b7c4134 to get more accurate results

@@            Coverage Diff             @@
##             main     #518      +/-   ##
==========================================
+ Coverage   90.63%   90.73%   +0.10%     
==========================================
  Files           6        6              
  Lines        1783     1835      +52     
==========================================
+ Hits         1616     1665      +49     
- Misses        167      170       +3

Impacted Files	Coverage Δ
python/objToJSON.c	`87.09% <88.88%> (+0.05%)`	⬆️
lib/ultrajsonenc.c	`85.12% <100.00%> (+0.15%)`	⬆️
tests/test_ujson.py	`99.60% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6860f1...b7c4134. Read the comment docs.

Erotemic · 2022-04-05T02:01:11Z

@JustAnotherArchivist The latest commit removes the ValueError on an non-pure space indent, and allows a completely custom UTF8 indent for full compatibility with Python's json.

I've verified that it works, but I am not a C expert, so please double check and critique my logic and implementation of this feature. You mentioned that it requires proper memory management in the corresponding issue, but IIUC, PyUnicode_AsUTF8AndSize and PyBytes_AsString do not require the user to deallocate the buffer, so I'm not sure what other memory management is needed.

bwoodsend · 2022-04-05T20:13:23Z

Mind rebasing this so as to include #519 (I imagine that there will be merge conflicts)? Then we can dig around for potential memory issues.

hugovk · 2022-04-05T20:30:52Z

@bwoodsend I just turned this setting on in the repo:

Which gives us these buttons:

And it says there are no conflicts. So if you like, you can rebase it right here :)

JustAnotherArchivist

To clarify on the memory management, I wasn't talking about the Python layer but the buffer overflow security issue that was just fixed in #519. The way it's implemented here, I think it shouldn't reintroduce any security issues. The buffer reservations use enc->indent for the size calculation, and that's set correctly for the string case, I think. (The optional newline is already always accounted for.) So that seems fine.

Some further comments below.

lib/ultrajson.h

lib/ultrajsonenc.c

JustAnotherArchivist · 2022-04-06T00:09:17Z

python/objToJSON.c

+  }
+
+  *_outLen = PyBytes_Size(newObj);
+  return PyBytes_AsString(newObj);


As you mentioned, PyUnicode_AsUTF8AndSize and PyBytes_AsString don't require dereferencing. But as far as I can tell, PyUnicode_AsUTF8String does. So there'd need to be a Py_DECREF(newObj) call between PyBytes_AsString and returning, I think.

I'll look more into that. Also wrt to this code, is there any better way of implementing the _PyUnicodeToChars logic? I took this logic from elsewhere in the file that was doing a similar, but not identical thing. I'm curious what best practices are here.

It seems like this is indeed the way to do it. Looks like the other similar code keeps a reference to newObj and then later calls Py_XDECREF on it, by the way.

Also, sidenote: PyUnicode_AsUTF8AndSize is part of the stable ABI since Python 3.10, so once that's the minimum supported version, only the code in the #ifndef will be needed.

I just looked at this again as I'm trying to fix the issues with surrogates. PyBytes_AsString returns a pointer to the internal char* buffer. This means the PyObject can't be DECREFd until the buffer is no longer needed. This is why the similar existing function stores the newObj in the context struct.

It should be possible to refactor PyUnicodeToUTF8 slightly to make it reusable for this (and it'll be useful for other things as well, e.g. separators). Then the PyObject would be kept and DECREFd at the end of objToJSON. I'll incorporate that refactor in my surrogate fix since it requires changes to that function anyway.

The fix for the surrogates unblocked me in terms of what I could do in this PR. Using PyUnicode_AsEncodedString was the key. I rebased on your branch so I could get those changes, and the tests are now at least working locally. Let's see how they do on the CI. This PR probably need a bit of cleanup because I did a bunch of print debugging.

You can use my PyUnicodeToUTF8Raw, no need for the duplicate function. You will need to take care of DECREFing the PyBytes object at the end though as mentioned above (also in case of objToJSON returning early!).

python/objToJSON.c

JustAnotherArchivist · 2022-04-06T00:14:48Z

tests/test_ujson.py

+    output0c = ujson.encode(data, indent="")
+    output0d = ujson.encode(data, indent=0)


These (and also a negative ints) should actually produce a different output than default and indent=None, namely inserting newlines but no indentation. However, that bug has been present for longer (#317), and if you don't want to tackle it, it could be done at a later time.

tests/test_ujson.py

JustAnotherArchivist · 2022-04-06T04:54:45Z

lib/ultrajsonenc.c

+          for (i = 0; i < enc->indent; i++)
+            Buffer_AppendCharUnchecked(enc, enc->indent_chars[i]);


For performance, this should probably be replaced with a memcpy, though an extra check in debug mode that this doesn't overrun the buffer would be a good idea. Something like this, I think:

Suggested change

for (i = 0; i < enc->indent; i++)

Buffer_AppendCharUnchecked(enc, enc->indent_chars[i]);

#ifdef DEBUG

if (enc->end - enc->offset < enc->indent)

{

fprintf(stderr, "Ran out of buffer space during Buffer_AppendIndentUnchecked()\n");

abort();

}

#endif

memcpy(enc->offset, enc->indent_chars, enc->indent);

enc->offset += enc->indent;

(Come to think of it, such checks would be a good idea on the other memcpys as well, but that's for another PR.)

A test for this that's basically a merger between test_dump_huge_indent and test_dump_long_string would be good as well. Similar variation, arranged such that it should hit a memory boundary. It's too late right now for me to think this through fully though.

I forgot that of course enc->offset needs to be incremented by enc->indent after the copy as well. Fixed that now.

#529 adds a Buffer_memcpy function that could be used instead once merged.

Erotemic · 2022-04-11T00:08:36Z

Issues I'm trying to resolve:

indentChars can now contain NULL chars and be considered valid. Had to change a PyUnicode_FromString to PyUnicode_FromStringAndSize and keep track of the size. I haven't quite gotten that right yet.

Having an issue with:
UnicodeEncodeError: 'utf-8' codec can't encode character '\udfff' in position 2: surrogates not allowed

I'm still working on this, albeit sparsely, If someone knows how to rework this to address corner cases, please go for it. I probably wont look at this again for at least a week.

JustAnotherArchivist · 2022-04-11T00:53:27Z

I commented on this before, but it's in a resolved comment, so repeating (and expanding) for visibility...

I'm not sure the surrogates or other weird characters in indent are anything to worry about. It would be nice to be completely generic, and the built-in json does support it, which is why I suggested it in the first place. But in practice, I can't think of any legitimate reason to do so. It obviously creates extra headaches and would anyway just produce invalid output as indentation can only consist of tabs and spaces (since those are the two characters, alongside line feeds and carriage returns, which are permissible as optional whitespace between two tokens in JSON).

On the other hand, there are issues with surrogates in values as well (#156, #447), so we need something that can carry those characters from PyUnicode to *char anyway. And if done right, it'd be trivial to reuse that and support this kind of JSON abuse for free.

JustAnotherArchivist · 2022-04-18T02:29:20Z

The tests are failing now due to the negative indentLength, I think. The buffer size calculations all use that variable, and when it's negative, it won't correctly reserve buffer space for the newline.

Erotemic · 2022-04-18T16:33:34Z

That makes sense. I see the issue now. As a quick fix, I defined a max function and used max(indentLenght, 0) in those calculations, but it might make more sense to refactor this such that indentLength is always non-negative and usable in those calculations by adding an additional variable to keep track of if any indent was specified (in which case we add newlines) or if the indent was defaulted (in which case we don't). This would avoid any jumps incurred by the max branches and potentially be faster.

Erotemic · 2022-04-20T04:49:00Z

Did a bit more work on this. I took out all of the max computations and instead ensured that indentLength is always non-negative. To do this I added a new indentEnabled flag to distinguish the case where the indentLength=0, but we still want to behave as though we are indenting (i.e. add newlines).

In a previous commit, there was a behavior change where extra spaces were added after separator chars, and this is now reverted (although that actually matches Python's output better, but it does waste space, so not sure if that's better or not).

I've cleaned up the script I've been using to check compatibility between ujson and json. It now reports that out of 6528 parameter combinations, there are no functional differences, and there are 224 superficial difference which are due to the aforementioned lack of whitespace ujson chooses not to put between separating characters. If we are going for byte-for-byte compatibility we may want to consider that, but if we don't care, then this PR is already more than good enough for my needs.

I think there is still an outstanding issue where I'm forgetting a Py_XDECREF somewhere, the possible usage of Buffer_memcpy from #529, and that this PR depends on surrogate fixes in #530, but I think this is close to ready.

[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Fix compiler warnings Update python/objToJSON.c Co-authored-by: JustAnotherArchivist <JustAnotherArchivist@users.noreply.github.com> camelCase whitespace Update tests/test_ujson.py Co-authored-by: JustAnotherArchivist <JustAnotherArchivist@users.noreply.github.com>

for more information, see https://pre-commit.ci

JustAnotherArchivist · 2022-12-19T09:31:18Z

@Erotemic Are you still interested in working on this?

Erotemic · 2022-12-19T14:43:40Z

I would still very much like the feature, but if anyone else wanted to continue (or entirely replace) my work I wouldn't mind.

Erotemic force-pushed the nonint_indent branch from ea5e36a to 058b525 Compare April 5, 2022 00:09

Erotemic mentioned this pull request Apr 5, 2022

Allow indent to be a str or None for compatibility with Python's json #517

Open

bwoodsend force-pushed the nonint_indent branch from 9289cb9 to 8c35a44 Compare April 5, 2022 21:03

JustAnotherArchivist reviewed Apr 6, 2022

View reviewed changes

JustAnotherArchivist mentioned this pull request Apr 7, 2022

Accept a separators parameter #283

Closed

Erotemic force-pushed the nonint_indent branch from a1b6a9f to 029e694 Compare April 11, 2022 00:09

Erotemic force-pushed the nonint_indent branch 3 times, most recently from 6d8ecf1 to 6861525 Compare April 18, 2022 01:24

JustAnotherArchivist mentioned this pull request May 19, 2022

Generalize benchmarks #532

Closed

Erotemic and others added 10 commits June 1, 2022 11:37

Allow str and None values for indent

b8db074

Use older PyObject_call API

97fcfff

Debug code

ba4acd8

Differentiate integer vs explicit indent

e448fb3

remove printf

5f6abc6

Use PyUnicode_AsEncodedString

e48bec8

Use PyUnicode_AsEncodedString

b7cb1fb

Enable all agree checks

9accee7

remove flake8 long lines

947e637

pre-commit-ci bot and others added 7 commits June 1, 2022 11:37

[pre-commit.ci] auto fixes from pre-commit.com hooks

6e64c43

for more information, see https://pre-commit.ci

remove compat tests

9191492

Fix negative allocation

d5cf596

remove non portable min/max

0d7cdd1

Remove max in favor of indentEnabled

6bb2f7b

[pre-commit.ci] auto fixes from pre-commit.com hooks

71291ed

for more information, see https://pre-commit.ci

Fix -1 length issue and revert extra spaces in tests

9850ff0

Erotemic force-pushed the nonint_indent branch from 75895fc to 9850ff0 Compare June 1, 2022 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow str and None values for indent #518

Allow str and None values for indent #518

Erotemic commented Apr 5, 2022

codecov-commenter commented Apr 5, 2022 •

edited

Erotemic commented Apr 5, 2022

bwoodsend commented Apr 5, 2022

hugovk commented Apr 5, 2022

JustAnotherArchivist left a comment

JustAnotherArchivist Apr 6, 2022

Erotemic Apr 6, 2022

JustAnotherArchivist Apr 6, 2022

JustAnotherArchivist Apr 16, 2022

JustAnotherArchivist Apr 17, 2022

Erotemic Apr 18, 2022

JustAnotherArchivist Apr 18, 2022

JustAnotherArchivist Apr 6, 2022

JustAnotherArchivist Apr 6, 2022 •

edited

JustAnotherArchivist Apr 7, 2022

JustAnotherArchivist Apr 13, 2022

Erotemic commented Apr 11, 2022

JustAnotherArchivist commented Apr 11, 2022

JustAnotherArchivist commented Apr 18, 2022

Erotemic commented Apr 18, 2022

Erotemic commented Apr 20, 2022

JustAnotherArchivist commented Dec 19, 2022

Erotemic commented Dec 19, 2022

		output0c = ujson.encode(data, indent="")
		output0d = ujson.encode(data, indent=0)

		for (i = 0; i < enc->indent; i++)
		Buffer_AppendCharUnchecked(enc, enc->indent_chars[i]);

-          for (i = 0; i < enc->indent; i++)
-            Buffer_AppendCharUnchecked(enc, enc->indent_chars[i]);
+#ifdef DEBUG
+          if (enc->end - enc->offset < enc->indent)
+          {
+            fprintf(stderr, "Ran out of buffer space during Buffer_AppendIndentUnchecked()\n");
+            abort();
+          }
+#endif
+          memcpy(enc->offset, enc->indent_chars, enc->indent);
+          enc->offset += enc->indent;

Allow str and None values for indent #518

Are you sure you want to change the base?

Allow str and None values for indent #518

Conversation

Erotemic commented Apr 5, 2022

codecov-commenter commented Apr 5, 2022 • edited

Codecov Report

Erotemic commented Apr 5, 2022

bwoodsend commented Apr 5, 2022

hugovk commented Apr 5, 2022

JustAnotherArchivist left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JustAnotherArchivist Apr 6, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Erotemic commented Apr 11, 2022

JustAnotherArchivist commented Apr 11, 2022

JustAnotherArchivist commented Apr 18, 2022

Erotemic commented Apr 18, 2022

Erotemic commented Apr 20, 2022

JustAnotherArchivist commented Dec 19, 2022

Erotemic commented Dec 19, 2022

codecov-commenter commented Apr 5, 2022 •

edited

JustAnotherArchivist Apr 6, 2022 •

edited