Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteSourceJsonBootstrapper uses CharBufferReader for byte[] inputs #1079

Closed
wants to merge 3 commits into from

Conversation

schlosna
Copy link
Contributor

Initial draft to address #593 (and #995 (comment) ) when deserializing from a byte[] as the InputStreamReader code path triggers an 8KiB HeapByteBuffer allocation for StreamDecoder regardless of input byte array length. This allocation significantly penalizes smaller byte[] sources.

image

I would appreciate thoughts on the approach here implementing a Reader via a CharBuffer that is decoded via wrapping the source byte[]. This should avoid unnecessary 8KiB heap byte buffer allocation and leverage OpenJDK's continued charset decoding improvements (e.g. https://cl4es.github.io/2021/02/23/Faster-Charset-Decoding.html ).

Initial benchmarks from FasterXML/jackson-benchmarks#9 show CharBufferReader providing performance equivalent to ByteArrayInputStream source in worst case, and anywhere from ~2x to ~10x speedup in best case.

@carterkozak for thoughts as well.

# JMH version: 1.27
# VM version: JDK 20.0.2, OpenJDK 64-Bit Server VM, 20.0.2+9-FR
# 2021 Apple M1 Pro (aarch64)
Benchmark                                       (mode)                   (shape)        (type)  Mode  Cnt   Score   Error  Units
JsonArbitraryFieldNameBenchmark.parse          DEFAULT                   KEY_MAP  INPUT_STREAM  avgt    4   0.107 ± 0.001  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT                   KEY_MAP        READER  avgt    4   0.509 ± 0.030  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT                   KEY_MAP   CHAR_READER  avgt    4   0.127 ± 0.003  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4  10.175 ± 0.354  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT            RANDOM_KEY_MAP        READER  avgt    4   1.835 ± 0.040  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT            RANDOM_KEY_MAP   CHAR_READER  avgt    4   1.539 ± 0.105  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4  10.474 ± 0.178  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4   2.346 ± 0.066  us/op
JsonArbitraryFieldNameBenchmark.parse          DEFAULT  BEAN_WITH_RANDOM_KEY_MAP   CHAR_READER  avgt    4   1.894 ± 0.061  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN                   KEY_MAP  INPUT_STREAM  avgt    4   0.107 ± 0.003  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN                   KEY_MAP        READER  avgt    4   0.523 ± 0.044  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN                   KEY_MAP   CHAR_READER  avgt    4   0.127 ± 0.004  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4   8.891 ± 0.142  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN            RANDOM_KEY_MAP        READER  avgt    4   1.284 ± 0.027  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN            RANDOM_KEY_MAP   CHAR_READER  avgt    4   0.947 ± 0.022  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4   9.178 ± 0.113  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4   1.742 ± 0.110  us/op
JsonArbitraryFieldNameBenchmark.parse        NO_INTERN  BEAN_WITH_RANDOM_KEY_MAP   CHAR_READER  avgt    4   1.365 ± 0.022  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE                   KEY_MAP  INPUT_STREAM  avgt    4   0.533 ± 0.027  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE                   KEY_MAP        READER  avgt    4   0.484 ± 0.010  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE                   KEY_MAP   CHAR_READER  avgt    4   0.130 ± 0.001  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE            RANDOM_KEY_MAP  INPUT_STREAM  avgt    4   0.535 ± 0.037  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE            RANDOM_KEY_MAP        READER  avgt    4   0.502 ± 0.024  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE            RANDOM_KEY_MAP   CHAR_READER  avgt    4   0.163 ± 0.001  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE  BEAN_WITH_RANDOM_KEY_MAP  INPUT_STREAM  avgt    4   0.842 ± 0.057  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE  BEAN_WITH_RANDOM_KEY_MAP        READER  avgt    4   0.741 ± 0.032  us/op
JsonArbitraryFieldNameBenchmark.parse  NO_CANONICALIZE  BEAN_WITH_RANDOM_KEY_MAP   CHAR_READER  avgt    4   0.416 ± 0.008  us/op

if (n < 0L) {
throw new IllegalArgumentException("number of characters to skip cannot be negative");
}
int skipped = Math.min((int) n, this.charBuffer.remaining());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't cast long to int - there is an edge case where n is too big and the cast value could be nonsense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change this to:

Suggested change
int skipped = Math.min((int) n, this.charBuffer.remaining());
int skipped = Math.min(Math.toIntExact(n), this.charBuffer.remaining());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion works for values of n greater than max-int. Your suggestion will raise an exception. If the API defines n to be long, I think we should respect the API and handle large longs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies - I had planned to suggest a change but then didn't. The suggestion is

 Math.min((int) Math.min(n, Integer.MAX_VALUE), this.charBuffer.remaining());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a funny set of APIs, CharBuffer.remaining returns an int so that suggests you can't have more the max-int chars. So maybe, your suggestion works fine. The skip API accepts longs but it appears that it not something we should worry about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the CharBuffer is wrapping a char[] so is practically limited to Integer.MAX_VALUE - 8


@Override
public void mark(int readAheadLimit) {
this.charBuffer.mark();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ignores the input which seems wrong - if we can't work out the right impl, then you should change the markSupported to return false

public void close() {
this.charBuffer.position(this.charBuffer.limit());
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a new line at end of file?

assertArrayEquals("\0\0".toCharArray(), chars);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

files should end with new lines

import java.io.Reader;
import java.nio.CharBuffer;

public class CharBufferReader extends Reader {
Copy link
Member

@pjfanning pjfanning Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do anything yet but since CharBufferReader is only used in one place - I think it would be a good idea to move it to the same package as ByteSourceJsonBootstrapper and to make it package private and final to discourage its use by other users. We should at least add javadoc to the class saying that this class is only for internal use of jackson-core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, wanted to figure out where all this might be needed then encapsulate.

private final CharBuffer charBuffer;

public CharBufferReader(CharBuffer buffer) {
this.charBuffer = buffer.duplicate();
Copy link
Member

@pjfanning pjfanning Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to duplicate the char buffer? If we strongly discourage the use of this class and we only have 1 usage of this class - that usage will not modify the input buffer. If we know the input buffer won't be modified, then theoretically, we don't need to duplicate it.

Copy link
Contributor Author

@schlosna schlosna Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will refactor this when encapsulating, was being defensive.

Copy link
Contributor Author

@schlosna schlosna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @pjfanning ! I have updated to address some of your comments.

Comment on lines +235 to 242
int size = _inputEnd - _inputPtr;
if (size >= 0 && size <= 8192) {
// [jackson-core#488] Avoid overhead of heap ByteBuffer allocated by InputStreamReader
// when processing small inputs up to 8KiB.
Charset charset = Charset.forName(enc.getJavaName());
return new CharBufferReader(charset.decode(ByteBuffer.wrap(_inputBuffer, _inputPtr, _inputEnd)));
}
in = new ByteArrayInputStream(_inputBuffer, _inputPtr, _inputEnd);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to thoughts here on threshold.

This new path decoding from ByteBuffer to CharBuffer will allocate a char[] based on the encoding Charset, so I chose 8192 as that would be the break even point for InputStreamReader's StreamDecoder's 8192 byte heap ByteBuffer.

// [jackson-core#488] Avoid overhead of heap ByteBuffer allocated by InputStreamReader
// when processing small inputs up to 8KiB.
Charset charset = Charset.forName(enc.getJavaName());
return new CharBufferReader(charset.decode(ByteBuffer.wrap(_inputBuffer, _inputPtr, _inputEnd)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how this compares with constructing a string:

return new StringReader(new String(_inputBuffer, _inputPtr, _inputEnd - _inputPtr, charset));

The string methods tend to perform very well due to extensive optimization in the jdk, however when we use a reader, the char[] used by CharBufferReader may perform better since it's not converting between bytes and chars on read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will benchmark and test this out. For Latin-1/ASCII compressed strings, we'll eat the cost of a byte array copy on string creation and decode should be no-op, while non-Latin-1 case may be a bit more expensive.

Copy link
Member

@pjfanning pjfanning Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JsonEncoding class only supports Unicode charsets. UTF8, variants of UTF16, variants of UTF32. 7 bit ASCII is a subset of UTF8. Latin-1 is not supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FasterXML/jackson-benchmarks#9 (comment) shows new String to StringReader is better, so I created #1081 as an alternative to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JsonEncoding class only supports Unicode charsets. UTF8, variants of UTF16, variants of UTF32. 7 bit ASCII is a sunset of UTF8. Latin-1 is not supported.

I should have been more precise. The Latin-1 was intended to reference the single byte encoding compact strings JEP 254, but I should have said non-ASCII UTF-8 will be a bit more expensive in terms of memory overhead.

@schlosna
Copy link
Contributor Author

I'm going to close this out in favor of the simpler, and more efficient #1081

@schlosna schlosna closed this Aug 11, 2023
@schlosna schlosna deleted the ds/CharBufferReader branch August 11, 2023 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants