[DRAFT]Enhance Performance of AsciiString Methods #13534

jchrys · 2023-08-07T04:25:01Z

Motivation:
AsciiString is a specialized class designed for ASCII character handling. However, it currently lacks specialized algorithms, which hinders its performance. but by implementing SWAR and other techniques, we can significantly improve their performance.

Modification:
Implemented branchless bulk case conversion for 8 characters at a time.
Optimized checks for the existence of lowercase or uppercase characters.
Utilized SWAR to efficiently find a given byte.

Result:
Improved performance.
Resolves #13522

jchrys · 2023-08-07T04:26:35Z

This PR is in a very early phase, and I've just implemented the algorithm.
I opened this Draft PR to share my progress and get feedback from the community at this stage.
Any input, suggestions, or reviews are highly appreciated.
I'm working on improving and refining the changes based on your feedback.
Thank you for taking the time to review it! ❤️

franz1981 · 2023-08-07T06:30:43Z

Well done @jchrys just few hints:

verify that applying a pattern for 4 bytes is worthy to match the typical use cases for small strings
often AsciiString have String cached within which make them a good candidate to use String::indexOf which is a JVM intrinsic - please verify if is worthy to use it instead of the AsciiString SWAR one

I will check the bit trick part with a paper with calm (the uppercase/lower case ones)

franz1981 · 2023-08-07T09:54:28Z

Last but not least, the math seem sounds (but I really nee to double check again), but I have no idea if there s any reference about it (especially the ranged ones) - do you have any link to share?

richardstartin · 2023-08-07T11:40:50Z

I feel like this could be simplified a lot by recognising that upper and lower case ASCII characters differ only by 0x20 (lower case characters intersect with 0x20, upper case letters do not). So to do a case insensitive search you compile a pattern for the character you're looking for (e.g. 'a' * 0x101010101010101L) as is currently done, and then need to map the input into lower case too (input | 0x2020202020202020L) which makes the subsequent index-of check case insensitive. So adding an overload which takes a mask to preprocess input (i.e. 0x2020202020202020L if you're passing in lower case patterns) and then calls the existing firstAnyPattern method would simplify things a lot.

richardstartin · 2023-08-07T11:44:31Z

Oh, and it goes without saying that boolean containsUpperCase(long word) can be implemented as return (word & 0x2020202020202020L) != 0x2020202020202020L, boolean containsLowerCase(long word) as (word & 0x2020202020202020L) != 0

franz1981 · 2023-08-07T12:07:52Z

Thanks @richardstartin !
I am adding this reference too https://github.com/facebook/folly/blob/main/folly/String.cpp#L523-L613 in case we can find anything interesting

jchrys · 2023-08-07T12:14:20Z

@franz1981
Thank you for your guidance! 😍
I will attempt to verify the performance of 4-byte SWAR and add a benchmark to compare it against String.indexOf.
As for the ranged approach, I utilized the range technique from the book "Hackers Delight" (Chapter 6. Searching word).

franz1981 · 2023-08-08T06:16:45Z

Hi @jchrys please check the suggestions from @richardstartin and the link I have shared; it's likely that we don't need that complexity for the lowercase/uppercase check and translation.
Re 4 bytes SWAR i mean "in addition" to the existing one
eg if 13 bytes long string is used, a 1 round of 8 bytes loop, 1 round of 4 bytes and 1 single byte unrolled

jchrys · 2023-08-08T12:18:55Z

@franz1981
Absolutely, I'll keep you posted! 😊

jchrys · 2023-08-13T16:36:36Z

Hello, @franz1981. I have the results to share.

The suggested toLowerCase and caseInsensitiveSearch methods are not applicable in our case, as we handle characters beyond the alphabet. For instance, using suggested toLowerCase converts [ to {.
I've found that the CaseConversion method from folly performs faster than my current implementation.
I plan to adopt this approach.
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, openjdk 17.0.8 2023-07-18, Ubuntu 22.04.3 LTS. (benchmark branch)

Benchmark                                                 (size)   Mode  Cnt    Score   Error      Units
AsciiStringUtilCaseConversionBenchmark.toLowerCase             7  thrpt   16  104.415 ± 8.454     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCase            16  thrpt   16   75.836 ± 0.243     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCase            23  thrpt   16   45.810 ± 0.490     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCase            32  thrpt   16   51.081 ± 0.073     ops/us

AsciiStringUtilCaseConversionBenchmark.toLowerCaseFolly        7  thrpt   16  107.788 ± 5.694     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCaseFolly       16  thrpt   16   84.052 ± 0.136     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCaseFolly       23  thrpt   16   48.285 ± 0.237     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCaseFolly       32  thrpt   16   60.809 ± 0.067     ops/us

AsciiString.indexOfvsString.indexOf
I attempted to surpass the performance of String.indexOf with AsciiString.indexOf, but I was unsuccessful when the size was equal to or greater than 16. (tested both cached and non-cached AsciiString)
It seems that openJDK's String.indexOf leverages AVX2.

Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, openjdk 17.0.8 2023-07-18, Ubuntu 22.04.3 LTS.(benchmark branch)

Benchmark                                        (logPermutations)  (size)   Mode  Cnt    Score   Error      Units
AsciiStringIndexOfBenchmark.indexOf                             11       7  thrpt   16  195.032 ± 0.235     ops/us
AsciiStringIndexOfBenchmark.indexOf                             11      16  thrpt   16  117.885 ± 0.069     ops/us
AsciiStringIndexOfBenchmark.indexOf                             11      23  thrpt   16   85.556 ± 0.575     ops/us
AsciiStringIndexOfBenchmark.indexOf                             11      32  thrpt   16   78.403 ± 0.061     ops/us

AsciiStringIndexOfBenchmark.stringIndexOf                       11       7  thrpt   16   71.387 ± 6.450     ops/us
AsciiStringIndexOfBenchmark.stringIndexOf                       11      16  thrpt   16  154.637 ± 0.466     ops/us
AsciiStringIndexOfBenchmark.stringIndexOf                       11      23  thrpt   16   69.403 ± 0.592     ops/us
AsciiStringIndexOfBenchmark.stringIndexOf                       11      32  thrpt   16  150.909 ± 0.483     ops/us

4Byte SWAR
Applying 4Byte SWAR optimization to AsciiString.indexOf did not yield improved performance compared to a linear unrolled approach. Surprisingly, the 4Byte SWAR optimization was actually slower in this context.
However, the 4Byte SWAR optimization demonstrated better performance in the AsciiStringUtil.containsLowerCase function.

franz1981 · 2023-08-15T10:45:36Z

The folly impl is expected to be faster, so, happy that numbers reflect it. I think the approach should be used for both (upper/lower case conversions and case insensitive comparisons).
The String:;indexOf analysis should add some print assembly and perfasm analysis in, to validate that the Intrinsic doesn't kick in with <16 long strings.
The idea behind checking this, is to find a cutoff value to decide when use it Vs using the SWAR version, assuming the cached String to be available.
Thanks to JDK compact strings, when AsciiString got a cache String in, is very likely it is a Latin one, and indexOf to be beneficial, if vectorized.

jchrys · 2023-08-15T15:55:00Z

Thank you for the detailed explanation! I will delve into this and analyze it. 😄

jchrys · 2023-08-20T10:21:14Z

Hello, @franz1981.
I discovered that the intrinsic implementation of String.indexOf(latin) leverages AVX and AVX2 in intel x86-64 machines.
it appears that the function performs comparison using 32-byte and 16-byte blocks. and incase where the remaining bytes are less than 16, it employs single-byte comparison against the byteToFind value.

The following assembly code has been extracted from perfasm output. the accompanying comments might not be accurate. (I added them for my own reference). (x86_64, 1x10x2, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, openjdk 17.0.8, Ubuntu 22.04.3 LTS)

0.12%          0x00007f748109d8f0:   lea    0x10(%rsi,%rcx,1),%rdi
   0.46%       0x00007f748109d8f5:   mov    %rdi,%rbx
   1.43%       0x00007f748109d8f8:   cmp    $0x10,%edx                  ; compare length, 16 (%edx = length)
               0x00007f748109d8fb:   jl     0x00007f748109d97a          ; jump if length is less than 16
   2.10%       0x00007f748109d901:   cmp    $0x20,%edx                  ; compare length, 32
          ╭    0x00007f748109d904:   jl     0x00007f748109d94           ; jump if length is less than 32
   0.98%  │    0x00007f748109d90a:   vmovd  %eax,%xmm0                  ; move eax(charToFind) to %xmm0(128-bit register).
   0.55%  │    0x00007f748109d90e:   vpbroadcastb %xmm0,%ymm0           ; broadcast xmm0 to ymm0(256-bit register)
   1.41%  │    0x00007f748109d913:   vpxor  %ymm1,%ymm1,%ymm1           ; zeroout ymm1
   0.85%  │    0x00007f748109d917:   mov    %edx,%ecx                   ; copy edx(length) to ecx
   0.94%  │    0x00007f748109d919:   and    $0xffffffe0,%ecx            ; get 256bit count (ecx = length)
   0.45%  │    0x00007f748109d91c:   and    $0x1f,%edx                  ; get less than 256bit count (ecx = length)
   1.60%  │↗   0x00007f748109d91f:   vmovdqu (%rbx),%ymm2               ; load 32byte from underlying array to ymm2
   4.20%  ││   0x00007f748109d923:   vpcmpeqb %ymm0,%ymm2,%ymm2         ; compare each byte in ymm0 and ymm2 and store result in ymm2
  11.46%  ││   0x00007f748109d927:   vptest %ymm2,%ymm1                 ; perform test 
   2.04%  ││   0x00007f748109d92c:   jae    0x00007f748109d99b          ; jump if found 
   3.48%  ││   0x00007f748109d932:   add    $0x20,%rbx                  ; advance rbx by 32
   2.93%  ││   0x00007f748109d936:   sub    $0x20,%ecx                  ; substract ecx(256bit count) by 32
          │╰   0x00007f748109d939:   jne    0x00007f748109d91f          ; jump if ecx is not equal to zero
   0.05%  │ ╭  0x00007f748109d93b:   jmp    0x00007f748109d94d          ; unconditional jump
          ↘ │  0x00007f748109d940:   vmovd  %eax,%xmm0                  ; move eax(byteToFind) to xmm0(128bit)
            │  0x00007f748109d944:   vpxor  %xmm1,%xmm1,%xmm1           ; zero-out xmm1
            │  0x00007f748109d948:   vpshufb %xmm1,%xmm0,%xmm0          ; 
   0.07%    ↘  0x00007f748109d94d:   cmp    $0x10,%edx                  ; compare edx,16
               0x00007f748109d950:   jl     0x00007f748109d97a          ; jump if less than 16
               0x00007f748109d956:   mov    %edx,%ecx                   ; copy edx(length) to ecx
               0x00007f748109d958:   and    $0xfffffff0,%ecx            ; 16byte count
               0x00007f748109d95b:   and    $0xf,%edx                   ; less than less than 16byte count
               0x00007f748109d95e:   vmovdqu (%rbx),%xmm2               ; load 16bytes from rbx to xmm2
               0x00007f748109d962:   vpcmpeqb %xmm0,%xmm2,%xmm2         ; compare xmm0 and xmm2 and save result to xmm2
               0x00007f748109d966:   vptest %xmm2,%xmm1                 ; test xmm2 and xmm1
               0x00007f748109d96b:   jae    0x00007f748109d99b          ; jump if test result is found
               0x00007f748109d971:   add    $0x10,%rbx                  ; advance rbx by 16
               0x00007f748109d975:   sub    $0x10,%ecx                  ; subtract remaining by 16
               0x00007f748109d978:   jne    0x00007f748109d95e          ; jump if remaining is not 0
               0x00007f748109d97a:   test   %edx,%edx                   ; test if length is 0
               0x00007f748109d97c:   je     0x00007f748109d994          ; jump, if length is 0
               0x00007f748109d982:   movzbl (%rbx),%ecx                 ; move rbx to ecx register(array idx)
               0x00007f748109d985:   cmp    %ecx,%eax                   ; compare byteToFind(eax) for and cur targeting byte(ecx)
               0x00007f748109d987:   je     0x00007f748109d9a5          ; jump, if found
               0x00007f748109d989:   add    $0x1,%rbx                   ; advance arrayIdx
               0x00007f748109d98d:   sub    $0x1,%edx                   ; decrement length
               0x00007f748109d990:   je     0x00007f748109d994          ; jump, if (edx = length) is zero
               0x00007f748109d992:   jmp    0x00007f748109d982          ; unconditional jump
               0x00007f748109d994:   mov    $0xffffffff,%ebx            ; not found return -1
               0x00007f748109d999:   jmp    0x00007f748109d9a8          ; 
               0x00007f748109d99b:   vpmovmskb %ymm2,%ecx               ; dealing with vector result
               0x00007f748109d99f:   bsf    %ecx,%eax                   ;

The chart below presents a comparison between AsciiString.indexOf()(Green) and AsciiString.toString().indexOf()(Blue).

(Please let me know if you require additional information)

I believe that 16 could be a suitable candidate for the cutoff value. What are your thoughts on this?

franz1981 · 2023-08-20T11:29:39Z

Excellent analysis @jchrys well done!!
Soooo...just a last question and we are ready to go:
The toString.indexOf case isn't creating the String right? It's leveraging a cache one I suppose....

Related the cut-off: the idea is to check if the cached String exist only if the size is less than 16 than use the right method.
I suppose that the unrolled trick I have made it's a bit making results unfair, 'cause the stride by which the JIT unroll its 16 char long loop still require more comparisons, branches and, in general, instructions. You can verify it by running a single experiment adding perfnorm numbers.
See https://blogs.oracle.com/javamagazine/post/loop-unrolling for some reference.

jchrys · 2023-08-20T13:07:41Z

Hello, @franz1981 . Thanks a lot! 😃

Soooo...just a last question and we are ready to go:
The toString.indexOf case isn't creating the String right? It's leveraging a cache one I suppose....

Yes, that's correct. It uses a cache and doesn't create a new String. I cached it during the initialization phase. (Link)

    public static Object blackhole;

    @Setup(Level.Trial)
    @SuppressJava6Requirement(reason = "using SplittableRandom to reliably produce data")
    public void init() {
       ---snip ---
        for (int i = 0; i < permutations; ++i) {
            ---snip---
            data[i] = new AsciiString(byteArray);
            blackhole = data[i].toString(); // cache
        }
    }

I will look into the topic of Related the cut-off

plokhotnyuk · 2023-08-20T13:33:22Z

@jchrys Could you please check how performance scale with 2, 4, 8, 16 threads using '@thread' annotation of JMH?

jchrys · 2023-08-20T13:42:39Z

@plokhotnyuk Sure 😄 I will notify you once I have the results. (It might take some time)

franz1981 · 2023-08-20T15:08:10Z

@plokhotnyuk hi! Why with multiple threads? Vectorization isn't involved on Netty side (although we compare against)...are you worried of some instability in the system?

@jchrys remember to use numactl --localalloc -N 0 or whatever node, tuned latency profile, disable turbo boost, while running it.
If you cannot do any of this, just UseNumanis better than nothing...

jchrys · 2023-08-20T15:34:36Z

Hi, @franz1981.
Thanks! I will review and conform those configurations.

Regarding the tuned latency profile, would the network-latency(Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance) profile be suitable for the purpose?

franz1981 · 2023-08-20T17:29:40Z

Yep @jchrys it should be ok!
Last 2 things, but not least; the benchmark where is placing the found element? I think is not fair between String/AsciiString...because the former will get more comparisons if it hit the 16-sized tail loop while AsciiString (always) not.

Very last point is: in the best case for SIMD version eg 64 the performance is not (at all) near to the expected speedup if compared to SWAR too eg 4 times.
This could be due to cache bandwidth limitation or...some other quirks: to know that it would be great to report how cache misses changes while varying the input size.

These 2 last points or not related the PR but just for scientific curiosity: I would expect the JDK to outperform much more the purely Java version by more than just 2X, best case...unless the bottleneck is elsewhere (reducing the number of inputs/dataset to the bare minimum should help there as well, given that are both branchless and we can always use 64 as target size for both).

plokhotnyuk · 2023-08-21T10:27:25Z

@franz1981 @jchrys Usually Netty-based services use several threads, so it is quite natural to run benchmarks using multiple threads.

Here is an example of comparison between 1 and 16 threads, so you can easily see that using vectorization, allocations or concurrent data structures reduce scalability over cores.

franz1981 · 2023-08-21T10:45:11Z

Sorry @plokhotnyuk let me gently disagree here: the usual context of execution of Netty here doesn't apply to very fine grain (micro, really) benchmarks like these.

Unless there are concerns related the scalability of vectorized instructions, I don't see a point to use multiple threads for a type of benchmark which doesn't involve the full Netty pipeline stack nor any form of sharing/concurrent data structures (not allocation intensive operations). Running the same benchmark with -prof gc will help to show the last point related allocations.

jchrys · 2023-08-25T16:08:18Z

Hello, @franz1981

I suppose that the unrolled trick I have made it's a bit making results unfair, 'cause the stride by which the JIT unroll its 16 char long loop still require more comparisons, branches and, in general, instructions. You can verify it by running a single experiment adding perfnorm numbers.
See https://blogs.oracle.com/javamagazine/post/loop-unrolling for some reference.

Could you please explain a bit more about the paragraph above? I'm having trouble understanding why using unrolledFirstIndexOf could cause things to be unfair.

normanmaurer · 2024-02-20T17:33:01Z

What is the state of this one ?

jchrys · 2024-03-13T12:29:58Z

Let me work on this issue. I will revisit it.

Motivation: Currently, `AsciiString#indexOf` method using naive iteration algorithm. Modification: Utilize SWAR technique. Result: Faster `AsciiString#indexOf`.

franz1981 mentioned this pull request Nov 22, 2023

Draft: Improve Unsafe::setMemory for small writes openjdk/jdk#16760

Closed

3 tasks

jchrys added 4 commits March 17, 2024 21:40

Apply SWAR to AsciiString

1edcd90

Motivation: Currently, `AsciiString#indexOf` method using naive iteration algorithm. Modification: Utilize SWAR technique. Result: Faster `AsciiString#indexOf`.

Apply SWAR to case conversion

1d48e05

Apply SWAR to case conversion

88f3e9d

Apply EqualsIgnoreCase

9fd80db

jchrys added 20 commits March 17, 2024 22:10

remove unused

4760766

Resolve Checkstyle error

a1b6ad9

avoid boundcheck

2c651f8

Remove unused line

7cf37e4

ADD stringIndexOf benchmark

d82026f

ADD toLowercase, containsUpperCase benchmark

638cee0

Apply case conversion algorithm from folly

25d1bf7

Resolve Checkstyle complaint

79cd5fd

Improve AsciiStringUtilCaseConversionBenchmark

7f97110

notUnaligned case for equalsIgnoreCases

64f72ae

Apply final

01be7cb

Enhance JavaDoc

4e61471

Format AsciiStringIndexOfBenchmark

74dcd88

Format AsciiStringUtilCaseConversionBenchmark

7678ecd

Separate SWARUtil

baa5dfb

align terms.

ec5a210

add copyright header

0c4f622

make AsciiStringUtil package private

71d486a

Speed up ByteBufUtil#firstIndexOf

f9c6266

fix naming

a9d041f

jchrys force-pushed the 4.1-swar-ascii branch from 4f6eb3b to a9d041f Compare March 17, 2024 15:25

jchrys mentioned this pull request Mar 20, 2024

Enhance AsciiString#toLowerCase and AsciiString#toUpperCase #13913

Merged

franz1981 mentioned this pull request Mar 23, 2024

8320448: Accelerate IndexOf using AVX2 openjdk/jdk#16753

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT]Enhance Performance of AsciiString Methods #13534

[DRAFT]Enhance Performance of AsciiString Methods #13534

jchrys commented Aug 7, 2023 •

edited

jchrys commented Aug 7, 2023

franz1981 commented Aug 7, 2023

franz1981 commented Aug 7, 2023

richardstartin commented Aug 7, 2023

richardstartin commented Aug 7, 2023 •

edited

franz1981 commented Aug 7, 2023

jchrys commented Aug 7, 2023 •

edited

franz1981 commented Aug 8, 2023 •

edited

jchrys commented Aug 8, 2023 •

edited

jchrys commented Aug 13, 2023 •

edited

franz1981 commented Aug 15, 2023 •

edited

jchrys commented Aug 15, 2023

jchrys commented Aug 20, 2023 •

edited

franz1981 commented Aug 20, 2023

jchrys commented Aug 20, 2023 •

edited

plokhotnyuk commented Aug 20, 2023

jchrys commented Aug 20, 2023

franz1981 commented Aug 20, 2023 •

edited

jchrys commented Aug 20, 2023 •

edited

franz1981 commented Aug 20, 2023

plokhotnyuk commented Aug 21, 2023

franz1981 commented Aug 21, 2023 •

edited

jchrys commented Aug 25, 2023 •

edited

normanmaurer commented Feb 20, 2024

jchrys commented Mar 13, 2024 •

edited

[DRAFT]Enhance Performance of AsciiString Methods #13534

Are you sure you want to change the base?

[DRAFT]Enhance Performance of AsciiString Methods #13534

Conversation

jchrys commented Aug 7, 2023 • edited

jchrys commented Aug 7, 2023

franz1981 commented Aug 7, 2023

franz1981 commented Aug 7, 2023

richardstartin commented Aug 7, 2023

richardstartin commented Aug 7, 2023 • edited

franz1981 commented Aug 7, 2023

jchrys commented Aug 7, 2023 • edited

franz1981 commented Aug 8, 2023 • edited

jchrys commented Aug 8, 2023 • edited

jchrys commented Aug 13, 2023 • edited

franz1981 commented Aug 15, 2023 • edited

jchrys commented Aug 15, 2023

jchrys commented Aug 20, 2023 • edited

franz1981 commented Aug 20, 2023

jchrys commented Aug 20, 2023 • edited

plokhotnyuk commented Aug 20, 2023

jchrys commented Aug 20, 2023

franz1981 commented Aug 20, 2023 • edited

jchrys commented Aug 20, 2023 • edited

franz1981 commented Aug 20, 2023

plokhotnyuk commented Aug 21, 2023

franz1981 commented Aug 21, 2023 • edited

jchrys commented Aug 25, 2023 • edited

normanmaurer commented Feb 20, 2024

jchrys commented Mar 13, 2024 • edited

jchrys commented Aug 7, 2023 •

edited

richardstartin commented Aug 7, 2023 •

edited

jchrys commented Aug 7, 2023 •

edited

franz1981 commented Aug 8, 2023 •

edited

jchrys commented Aug 8, 2023 •

edited

jchrys commented Aug 13, 2023 •

edited

franz1981 commented Aug 15, 2023 •

edited

jchrys commented Aug 20, 2023 •

edited

jchrys commented Aug 20, 2023 •

edited

franz1981 commented Aug 20, 2023 •

edited

jchrys commented Aug 20, 2023 •

edited

franz1981 commented Aug 21, 2023 •

edited

jchrys commented Aug 25, 2023 •

edited

jchrys commented Mar 13, 2024 •

edited