Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate use of Compact strings for TextBuffer #910

Open
yawkat opened this issue Feb 2, 2023 · 9 comments
Open

Investigate use of Compact strings for TextBuffer #910

yawkat opened this issue Feb 2, 2023 · 9 comments

Comments

@yawkat
Copy link
Member

yawkat commented Feb 2, 2023

I was looking at some heap dumps and saw that TextBuffer still uses char[] internally. I think it could instead use an approach similar to the jdk String and StringBuilder, with a byte[] that either contains latin1 encoded as one byte per char, or utf-16 with two bytes per char. This would save a lot of memory in most standard cases, especially when the buffer becomes large.

It could also improve performance when constructing String instances a bit. Using the String charset constructor, it's possible to create a latin1 String directly from bytes with a single copy, while going through the char[] constructor needs to run compaction. I don't know if this is relevant though, the compaction is probably already very fast.

What do you think @cowtowncoder @pjfanning ?

@pjfanning
Copy link
Member

Could be useful to prototype and we could perf test it.

@pjfanning
Copy link
Member

The Reader based JSON parser would probably still be faster with a char array based TextBuffer. Possibly, we need separate byte and char TextBuffer impls.

@cowtowncoder
Copy link
Member

I would be surprised if that could perform well: wouldn't it need to re-encode UTF-8 in general case? And considering that input to decode (before being fed to TextBuffer) can come from either byte or char sources it'd be particularly challenging for char-sources.

I mean I am not against someone trying things out but my first gut reaction is that this does not make much sense. Happy to be proven wrong of course :)

@yawkat
Copy link
Member Author

yawkat commented Feb 2, 2023

it would have to re-encode to utf-16 if there is non-latin1 input, yes. this is also what stringbuilder does though, and i think it may be worth it. will depend heavily on the input data ofc.

maybe it would be worth trying to implement TextBuffer with a StringBuilder as the backing store instead of the char[] array. it should be fairly easy, and StringBuilder already has all the compaction machinery. So this approach could be good for testing at least.

@pjfanning
Copy link
Member

@yawkat StringBuilder approach loses the BufferRecycler support that jackson-core has. Maybe, we would need a StringBuilder recycler.

@cowtowncoder
Copy link
Member

cowtowncoder commented Feb 2, 2023

it would have to re-encode to utf-16 if there is non-latin1 input, yes. this is also what stringbuilder does though, and i think it may be worth it. will depend heavily on the input data ofc.

Really? StringBuilder does that? Wouldn't that lead to massive overhead for, say, chatAt(...).
Mostly it's just that since Java strated with modified UTF-16 approach, so much is based on use of char[]... I find it hard to believe things could be made processing efficient with hybrid approach.

But back to the original suggestion of reduced memory usage... I don't know. Buffer recycler does hold on to buffers, but otherwise these are transient buffers.
The only really sizable one(s) being used for case of mapper.writeValueAsString() which I hope users would avoid using anyway, there being better alternatives.

It is also possible that BufferRecycler itself becomes less useful with non-blocking (or new Panama (?) threads) runtimes.

EDIT: read about "compact strings" added in JDK 9 years ago (which I vaguely remember).
Those would be used for ASCII (or maybe Latin-1, that is, 0 - 255 codes?) Strings.
For those indexing would be efficient. But haven't been able to find (without d/l:ing JDK sources, been a while) what was done to StringBuilder/-Buffer; and if hybrid case was supported.
I guess it'd have to be to support all operations (for constants strings can determine if single-byte codes suffice; but not when dynamically building).

@yawkat
Copy link
Member Author

yawkat commented Feb 6, 2023

okay i've done some preliminary benchmarking in https://github.com/yawkat/jackson-core/tree/compact-strings . it is very interesting. The benchmark results without StringBuilder:

Benchmark                       (alphabet)  (length)  Mode  Cnt        Score       Error  Units
TextBufferBenchmark.parseBytes       ASCII        10  avgt    4      127.322 ±     3.525  ns/op
TextBufferBenchmark.parseBytes       ASCII      1000  avgt    4     1408.151 ±    35.624  ns/op
TextBufferBenchmark.parseBytes       ASCII   1000000  avgt    4  1530236.122 ± 48515.349  ns/op
TextBufferBenchmark.parseBytes      GERMAN        10  avgt    4      122.730 ±     5.878  ns/op
TextBufferBenchmark.parseBytes      GERMAN      1000  avgt    4     2314.222 ±    35.812  ns/op
TextBufferBenchmark.parseBytes      GERMAN   1000000  avgt    4  3558829.741 ± 82599.881  ns/op

and with StringBuilder:

Benchmark                       (alphabet)  (length)  Mode  Cnt        Score        Error  Units
TextBufferBenchmark.parseBytes       ASCII        10  avgt    4      122.041 ±      3.393  ns/op
TextBufferBenchmark.parseBytes       ASCII      1000  avgt    4      916.802 ±     24.734  ns/op
TextBufferBenchmark.parseBytes       ASCII   1000000  avgt    4  1087835.726 ±  38468.054  ns/op
TextBufferBenchmark.parseBytes      GERMAN        10  avgt    4      128.692 ±      4.979  ns/op
TextBufferBenchmark.parseBytes      GERMAN      1000  avgt    4     4051.095 ±    117.887  ns/op
TextBufferBenchmark.parseBytes      GERMAN   1000000  avgt    4  5377560.356 ± 159490.559  ns/op

As you can see, SB is basically the same for short strings and it's significantly slower for non-ascii strings (could be mitigated but probably not fixed 100%). However SB is significantly faster (30% faster) for long ascii strings.

The code is basically the same for both, with StringBuilder instead of char[] of course. There is one small change: I've replaced the ascii copying loops, which would previously store the individual bytes into the char[], with builder.append(new String(bytes, offset, length, StandardCharsets.ISO_8859_1));. This is responsible for the speedup, just replacing the array store with .append(char) is always slower. But with the bulk append, it gets very fast. Using ISO_8859_1 signals that the text can be copied as-is to the StringBuilder. My guess is that with this change, hotspot realizes this is a simple byte array copy, and saves some translating back and forth between utf-16 and ascii.

One problem here is that the changes required are really annoying. What I did was basically copy TextBuffer for SB (SBTextBuffer) and copied the BufferRecycler infrastructure for SB. I then created a base class for TextBuffer and SBTextBuffer (TextBufferBase) with abstract methods for what ParserBase uses. I changed ParserBase to use the new TextBufferBase and added a getter (_textBuffer()). Subclasses that use the original TextBuffer override _textBuffer() with a cast to TextBuffer. The new SBUTF8StreamJsonParser, basically a copy of UTF8StreamJsonParser, overrides it with a cast to SBTextBuffer. All this casting is (a) incompatible, (b) ugly, and (c) probably has some small impact on performance (it shows up in perfasm at least). So to implement this properly, we'd probably need a whole new SBUTF8StreamJsonParser that is independent of ParserBase. And presumably a non-blocking version too. That's lots of duplication.

Another technical challenge I've not looked at is that StringBuilders "degenerate" to utf-16 if they ever contain non-latin1 characters. That means that if any input contains non-ascii strings, the buffers in BufferRecycler become utf-16 (even after reset) and the performance benefit will disappear. This could be fixed by a simple jdk change (setLength(0) could reset the coder of StringBuilder to ASCII), but that'd take a while to appear in the releases. Until then, we would need to have separate recyclers for ASCII and non-ASCII StringBuilders, and support the transformation between them in SBTextBuffer... Fortunately this issue is not relevant in the benchmarks (strings on the same fork are homogeneous), but it would need to be resolved.

In summary, the preliminary performance numbers look really good for the "long ascii string" case. I think this is a common case (e.g. long base64 strings), so imo if this change was simple, it would be worth it. However given the technical challenges involved, I don't think it's feasible.

@cowtowncoder
Copy link
Member

@yawkat Very impressive investigation, thank you for doing this!

It is pretty interesting that in the expected good case (long ascii) there is tangible performance benefits.
But I agree that unfortunately due to reasons outlined it is probably not practical to try it out at this point in time.

Still, it is good to be aware of this idea, maybe something could come out of it in future.

@cowtowncoder cowtowncoder changed the title Compact strings for TextBuffer Investigate use of Compact strings for TextBuffer Feb 7, 2023
@yawkat
Copy link
Member Author

yawkat commented Feb 7, 2023

Another option I just thought of would be to use byte[] containing raw UTF-8, instead of StringBuilder. The Strings would then be constructed directly from the byte array, either as UTF-8 if it contains non-ascii characters, or as ISO_8859_1 if it contains only ascii (we can know this since we need to iterate anyway to detect quotes and escape sequences). This would have the same copying benefits of this test case, but would avoid the "degeneration" issue with StringBuilder.

Unfortunately it still has the same issues with TextBuffer compatibility. However I think it may be workable at some point to replace all uses of TextBuffer with such a byte[]. This is because the String(char[]) constructors that we currently use will usually do a "compression" to latin1 byte[] anyway. So if we instead use a byte[] on the buffer side, even for the Reader-based JsonParsers, we might not get a performance hit, because we simply move the "compression" step to latin1 to our side instead of having the String constructor do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants