ByteSourceJsonBootstrapper uses CharBufferReader for byte[] inputs #1079

schlosna · 2023-08-10T04:27:33Z

Initial draft to address #593 (and #995 (comment) ) when deserializing from a byte[] as the InputStreamReader code path triggers an 8KiB HeapByteBuffer allocation for StreamDecoder regardless of input byte array length. This allocation significantly penalizes smaller byte[] sources.

I would appreciate thoughts on the approach here implementing a Reader via a CharBuffer that is decoded via wrapping the source byte[]. This should avoid unnecessary 8KiB heap byte buffer allocation and leverage OpenJDK's continued charset decoding improvements (e.g. https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html ).

Initial benchmarks from FasterXML/jackson-benchmarks#9 show CharBufferReader providing performance equivalent to ByteArrayInputStream source in worst case, and anywhere from ~2x to ~10x speedup in best case.

@carterkozak for thoughts as well.

# JMH version: 1.27
# VM version: JDK 20.0.2, OpenJDK 64-Bit Server VM, 20.0.2+9-FR
# 2021 Apple M1 Pro (aarch64)
Benchmark                                       (mode)                   (shape)        (type)  Mode  Cnt   Score   Error  Units
JsonArbitraryFieldNameBenchmark.parse          DEFAULT                   KEY_MAP  INPUT_STREAM  avgt    4   0.107 ± 0.001  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT                   KEY_MAP        READER  avgt    4   0.509 ± 0.030  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT                   KEY_MAP   CHAR_READER  avgt    4   0.127 ± 0.003  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4  10.175 ± 0.354  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT            RANDOM_KEY_MAP        READER  avgt    4   1.835 ± 0.040  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT            RANDOM_KEY_MAP   CHAR_READER  avgt    4   1.539 ± 0.105  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4  10.474 ± 0.178  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4   2.346 ± 0.066  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT  BEAN_WITH_RANDOM_KEY_MAP   CHAR_READER  avgt    4   1.894 ± 0.061  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN                   KEY_MAP  INPUT_STREAM  avgt    4   0.107 ± 0.003  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN                   KEY_MAP        READER  avgt    4   0.523 ± 0.044  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN                   KEY_MAP   CHAR_READER  avgt    4   0.127 ± 0.004  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4   8.891 ± 0.142  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN            RANDOM_KEY_MAP        READER  avgt    4   1.284 ± 0.027  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN            RANDOM_KEY_MAP   CHAR_READER  avgt    4   0.947 ± 0.022  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4   9.178 ± 0.113  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4   1.742 ± 0.110  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN  BEAN_WITH_RANDOM_KEY_MAP   CHAR_READER  avgt    4   1.365 ± 0.022  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE                   KEY_MAP  INPUT_STREAM  avgt    4   0.533 ± 0.027  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE                   KEY_MAP        READER  avgt    4   0.484 ± 0.010  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE                   KEY_MAP   CHAR_READER  avgt    4   0.130 ± 0.001  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4   0.535 ± 0.037  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE            RANDOM_KEY_MAP        READER  avgt    4   0.502 ± 0.024  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE            RANDOM_KEY_MAP   CHAR_READER  avgt    4   0.163 ± 0.001  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4   0.842 ± 0.057  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4   0.741 ± 0.032  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE  BEAN_WITH_RANDOM_KEY_MAP   CHAR_READER  avgt    4   0.416 ± 0.008  us/op

pjfanning · 2023-08-10T08:27:07Z

src/main/java/com/fasterxml/jackson/core/io/CharBufferReader.java

+        if (n < 0L) {
+            throw new IllegalArgumentException("number of characters to skip cannot be negative");
+        }
+        int skipped = Math.min((int) n, this.charBuffer.remaining());


please don't cast long to int - there is an edge case where n is too big and the cast value could be nonsense

will change this to:

Suggested change

int skipped = Math.min((int) n, this.charBuffer.remaining());

int skipped = Math.min(Math.toIntExact(n), this.charBuffer.remaining());

My suggestion works for values of n greater than max-int. Your suggestion will raise an exception. If the API defines n to be long, I think we should respect the API and handle large longs.

Apologies - I had planned to suggest a change but then didn't. The suggestion is

Math.min((int) Math.min(n, Integer.MAX_VALUE), this.charBuffer.remaining());

It is a funny set of APIs, CharBuffer.remaining returns an int so that suggests you can't have more the max-int chars. So maybe, your suggestion works fine. The skip API accepts longs but it appears that it not something we should worry about.

Yeah, the CharBuffer is wrapping a char[] so is practically limited to Integer.MAX_VALUE - 8

pjfanning · 2023-08-10T08:28:24Z

src/main/java/com/fasterxml/jackson/core/io/CharBufferReader.java

+
+    @Override
+    public void mark(int readAheadLimit) {
+        this.charBuffer.mark();


this ignores the input which seems wrong - if we can't work out the right impl, then you should change the markSupported to return false

pjfanning · 2023-08-10T08:28:49Z

src/main/java/com/fasterxml/jackson/core/io/CharBufferReader.java

+    public void close() {
+        this.charBuffer.position(this.charBuffer.limit());
+    }
+}


can you add a new line at end of file?

pjfanning · 2023-08-10T08:30:44Z

src/test/java/com/fasterxml/jackson/core/io/CharBufferReaderTest.java

+            assertArrayEquals("\0\0".toCharArray(), chars);
+        }
+    }
+}


files should end with new lines

pjfanning · 2023-08-10T10:05:26Z

src/main/java/com/fasterxml/jackson/core/io/CharBufferReader.java

+import java.io.Reader;
+import java.nio.CharBuffer;
+
+public class CharBufferReader extends Reader {


Don't do anything yet but since CharBufferReader is only used in one place - I think it would be a good idea to move it to the same package as ByteSourceJsonBootstrapper and to make it package private and final to discourage its use by other users. We should at least add javadoc to the class saying that this class is only for internal use of jackson-core.

Will do, wanted to figure out where all this might be needed then encapsulate.

pjfanning · 2023-08-10T10:07:52Z

src/main/java/com/fasterxml/jackson/core/io/CharBufferReader.java

+    private final CharBuffer charBuffer;
+
+    public CharBufferReader(CharBuffer buffer) {
+        this.charBuffer = buffer.duplicate();


Is it necessary to duplicate the char buffer? If we strongly discourage the use of this class and we only have 1 usage of this class - that usage will not modify the input buffer. If we know the input buffer won't be modified, then theoretically, we don't need to duplicate it.

Will refactor this when encapsulating, was being defensive.

Address PR comments

schlosna

Thanks for the review @pjfanning ! I have updated to address some of your comments.

schlosna · 2023-08-10T17:08:01Z

src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java

+                    int size = _inputEnd - _inputPtr;
+                    if (size >= 0 && size <= 8192) {
+                        // [jackson-core#488] Avoid overhead of heap ByteBuffer allocated by InputStreamReader
+                        // when processing small inputs up to 8KiB.
+                        Charset charset = Charset.forName(enc.getJavaName());
+                        return new CharBufferReader(charset.decode(ByteBuffer.wrap(_inputBuffer, _inputPtr, _inputEnd)));
+                    }
                    in = new ByteArrayInputStream(_inputBuffer, _inputPtr, _inputEnd);


Open to thoughts here on threshold.

This new path decoding from ByteBuffer to CharBuffer will allocate a char[] based on the encoding Charset, so I chose 8192 as that would be the break even point for InputStreamReader's StreamDecoder's 8192 byte heap ByteBuffer.

carterkozak · 2023-08-10T18:15:15Z

src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java

+                        // [jackson-core#488] Avoid overhead of heap ByteBuffer allocated by InputStreamReader
+                        // when processing small inputs up to 8KiB.
+                        Charset charset = Charset.forName(enc.getJavaName());
+                        return new CharBufferReader(charset.decode(ByteBuffer.wrap(_inputBuffer, _inputPtr, _inputEnd)));


I wonder how this compares with constructing a string:

return new StringReader(new String(_inputBuffer, _inputPtr, _inputEnd - _inputPtr, charset));

The string methods tend to perform very well due to extensive optimization in the jdk, however when we use a reader, the char[] used by CharBufferReader may perform better since it's not converting between bytes and chars on read.

Good call, will benchmark and test this out. For Latin-1/ASCII compressed strings, we'll eat the cost of a byte array copy on string creation and decode should be no-op, while non-Latin-1 case may be a bit more expensive.

The JsonEncoding class only supports Unicode charsets. UTF8, variants of UTF16, variants of UTF32. 7 bit ASCII is a subset of UTF8. Latin-1 is not supported.

FasterXML/jackson-benchmarks#9 (comment) shows new String to StringReader is better, so I created #1081 as an alternative to this PR.

The JsonEncoding class only supports Unicode charsets. UTF8, variants of UTF16, variants of UTF32. 7 bit ASCII is a sunset of UTF8. Latin-1 is not supported.

I should have been more precise. The Latin-1 was intended to reference the single byte encoding compact strings JEP 254, but I should have said non-ASCII UTF-8 will be a bit more expensive in terms of memory overhead.

schlosna · 2023-08-11T02:10:03Z

I'm going to close this out in favor of the simpler, and more efficient #1081

schlosna added 2 commits August 9, 2023 23:54

Add CharBufferReader

4e7433d

ByteSourceJsonBootstrapper uses CharBufferReader for byte[] inputs

df09943

schlosna mentioned this pull request Aug 10, 2023

Add StringReader input benchmark FasterXML/jackson-benchmarks#9

Merged

pjfanning reviewed Aug 10, 2023

View reviewed changes

Encapsulate CharBufferReader implementation

f63bcd5

Address PR comments

schlosna commented Aug 10, 2023

View reviewed changes

carterkozak reviewed Aug 10, 2023

View reviewed changes

schlosna mentioned this pull request Aug 10, 2023

Make ByteSourceJsonBootstrapper use StringReader for < 8KiB byte[] inputs #1081

Merged

schlosna closed this Aug 11, 2023

schlosna deleted the ds/CharBufferReader branch August 11, 2023 02:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteSourceJsonBootstrapper uses CharBufferReader for byte[] inputs #1079

ByteSourceJsonBootstrapper uses CharBufferReader for byte[] inputs #1079

schlosna commented Aug 10, 2023

pjfanning Aug 10, 2023

schlosna Aug 10, 2023

pjfanning Aug 10, 2023

pjfanning Aug 10, 2023

pjfanning Aug 10, 2023

schlosna Aug 10, 2023

pjfanning Aug 10, 2023

pjfanning Aug 10, 2023

pjfanning Aug 10, 2023

pjfanning Aug 10, 2023 •

edited

schlosna Aug 10, 2023

pjfanning Aug 10, 2023 •

edited

schlosna Aug 10, 2023 •

edited

schlosna left a comment

schlosna Aug 10, 2023

carterkozak Aug 10, 2023

schlosna Aug 10, 2023

pjfanning Aug 10, 2023 •

edited

schlosna Aug 10, 2023

schlosna Aug 10, 2023

schlosna commented Aug 11, 2023

	int skipped = Math.min((int) n, this.charBuffer.remaining());
	int skipped = Math.min(Math.toIntExact(n), this.charBuffer.remaining());

ByteSourceJsonBootstrapper uses CharBufferReader for byte[] inputs #1079

ByteSourceJsonBootstrapper uses CharBufferReader for byte[] inputs #1079

Conversation

schlosna commented Aug 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjfanning Aug 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjfanning Aug 10, 2023 • edited

Choose a reason for hiding this comment

schlosna Aug 10, 2023 • edited

Choose a reason for hiding this comment

schlosna left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pjfanning Aug 10, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

schlosna commented Aug 11, 2023

pjfanning Aug 10, 2023 •

edited

pjfanning Aug 10, 2023 •

edited

schlosna Aug 10, 2023 •

edited

pjfanning Aug 10, 2023 •

edited