Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT]Enhance Performance of AsciiString Methods #13534

Draft
wants to merge 24 commits into
base: 4.1
Choose a base branch
from

Conversation

jchrys
Copy link
Contributor

@jchrys jchrys commented Aug 7, 2023

Motivation:
AsciiString is a specialized class designed for ASCII character handling. However, it currently lacks specialized algorithms, which hinders its performance. but by implementing SWAR and other techniques, we can significantly improve their performance.

Modification:
Implemented branchless bulk case conversion for 8 characters at a time.
Optimized checks for the existence of lowercase or uppercase characters.
Utilized SWAR to efficiently find a given byte.

Result:
Improved performance.
Resolves #13522

@jchrys
Copy link
Contributor Author

jchrys commented Aug 7, 2023

This PR is in a very early phase, and I've just implemented the algorithm.
I opened this Draft PR to share my progress and get feedback from the community at this stage.
Any input, suggestions, or reviews are highly appreciated.
I'm working on improving and refining the changes based on your feedback.
Thank you for taking the time to review it! ❤️

@franz1981
Copy link
Contributor

Well done @jchrys just few hints:

  • verify that applying a pattern for 4 bytes is worthy to match the typical use cases for small strings
  • often AsciiString have String cached within which make them a good candidate to use String::indexOf which is a JVM intrinsic - please verify if is worthy to use it instead of the AsciiString SWAR one

I will check the bit trick part with a paper with calm (the uppercase/lower case ones)

@franz1981
Copy link
Contributor

Last but not least, the math seem sounds (but I really nee to double check again), but I have no idea if there s any reference about it (especially the ranged ones) - do you have any link to share?

@richardstartin
Copy link

I feel like this could be simplified a lot by recognising that upper and lower case ASCII characters differ only by 0x20 (lower case characters intersect with 0x20, upper case letters do not). So to do a case insensitive search you compile a pattern for the character you're looking for (e.g. 'a' * 0x101010101010101L) as is currently done, and then need to map the input into lower case too (input | 0x2020202020202020L) which makes the subsequent index-of check case insensitive. So adding an overload which takes a mask to preprocess input (i.e. 0x2020202020202020L if you're passing in lower case patterns) and then calls the existing firstAnyPattern method would simplify things a lot.

@richardstartin
Copy link

richardstartin commented Aug 7, 2023

Oh, and it goes without saying that boolean containsUpperCase(long word) can be implemented as return (word & 0x2020202020202020L) != 0x2020202020202020L, boolean containsLowerCase(long word) as (word & 0x2020202020202020L) != 0

@franz1981
Copy link
Contributor

Thanks @richardstartin !
I am adding this reference too https://github.com/facebook/folly/blob/main/folly/String.cpp#L523-L613 in case we can find anything interesting

@jchrys
Copy link
Contributor Author

jchrys commented Aug 7, 2023

@franz1981
Thank you for your guidance! 😍
I will attempt to verify the performance of 4-byte SWAR and add a benchmark to compare it against String.indexOf.
As for the ranged approach, I utilized the range technique from the book "Hackers Delight" (Chapter 6. Searching word).

@franz1981
Copy link
Contributor

franz1981 commented Aug 8, 2023

Hi @jchrys please check the suggestions from @richardstartin and the link I have shared; it's likely that we don't need that complexity for the lowercase/uppercase check and translation.
Re 4 bytes SWAR i mean "in addition" to the existing one
eg if 13 bytes long string is used, a 1 round of 8 bytes loop, 1 round of 4 bytes and 1 single byte unrolled

@jchrys
Copy link
Contributor Author

jchrys commented Aug 8, 2023

@franz1981
Absolutely, I'll keep you posted! 😊

@jchrys
Copy link
Contributor Author

jchrys commented Aug 13, 2023

Hello, @franz1981. I have the results to share.

  1. The suggested toLowerCase and caseInsensitiveSearch methods are not applicable in our case, as we handle characters beyond the alphabet. For instance, using suggested toLowerCase converts [ to {.

  2. I've found that the CaseConversion method from folly performs faster than my current implementation.
    I plan to adopt this approach.
    Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, openjdk 17.0.8 2023-07-18, Ubuntu 22.04.3 LTS. (benchmark branch)

Benchmark                                                 (size)   Mode  Cnt    Score   Error      Units
AsciiStringUtilCaseConversionBenchmark.toLowerCase             7  thrpt   16  104.415 ± 8.454     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCase            16  thrpt   16   75.836 ± 0.243     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCase            23  thrpt   16   45.810 ± 0.490     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCase            32  thrpt   16   51.081 ± 0.073     ops/us

AsciiStringUtilCaseConversionBenchmark.toLowerCaseFolly        7  thrpt   16  107.788 ± 5.694     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCaseFolly       16  thrpt   16   84.052 ± 0.136     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCaseFolly       23  thrpt   16   48.285 ± 0.237     ops/us
AsciiStringUtilCaseConversionBenchmark.toLowerCaseFolly       32  thrpt   16   60.809 ± 0.067     ops/us
  1. AsciiString.indexOfvsString.indexOf
    I attempted to surpass the performance of String.indexOf with AsciiString.indexOf, but I was unsuccessful when the size was equal to or greater than 16. (tested both cached and non-cached AsciiString)
    It seems that openJDK's String.indexOf leverages AVX2.

Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, openjdk 17.0.8 2023-07-18, Ubuntu 22.04.3 LTS.(benchmark branch)

Benchmark                                        (logPermutations)  (size)   Mode  Cnt    Score   Error      Units
AsciiStringIndexOfBenchmark.indexOf                             11       7  thrpt   16  195.032 ± 0.235     ops/us
AsciiStringIndexOfBenchmark.indexOf                             11      16  thrpt   16  117.885 ± 0.069     ops/us
AsciiStringIndexOfBenchmark.indexOf                             11      23  thrpt   16   85.556 ± 0.575     ops/us
AsciiStringIndexOfBenchmark.indexOf                             11      32  thrpt   16   78.403 ± 0.061     ops/us

AsciiStringIndexOfBenchmark.stringIndexOf                       11       7  thrpt   16   71.387 ± 6.450     ops/us
AsciiStringIndexOfBenchmark.stringIndexOf                       11      16  thrpt   16  154.637 ± 0.466     ops/us
AsciiStringIndexOfBenchmark.stringIndexOf                       11      23  thrpt   16   69.403 ± 0.592     ops/us
AsciiStringIndexOfBenchmark.stringIndexOf                       11      32  thrpt   16  150.909 ± 0.483     ops/us
  1. 4Byte SWAR
    Applying 4Byte SWAR optimization to AsciiString.indexOf did not yield improved performance compared to a linear unrolled approach. Surprisingly, the 4Byte SWAR optimization was actually slower in this context.
    However, the 4Byte SWAR optimization demonstrated better performance in the AsciiStringUtil.containsLowerCase function.

@franz1981
Copy link
Contributor

franz1981 commented Aug 15, 2023

The folly impl is expected to be faster, so, happy that numbers reflect it. I think the approach should be used for both (upper/lower case conversions and case insensitive comparisons).
The String:;indexOf analysis should add some print assembly and perfasm analysis in, to validate that the Intrinsic doesn't kick in with <16 long strings.
The idea behind checking this, is to find a cutoff value to decide when use it Vs using the SWAR version, assuming the cached String to be available.
Thanks to JDK compact strings, when AsciiString got a cache String in, is very likely it is a Latin one, and indexOf to be beneficial, if vectorized.

@jchrys
Copy link
Contributor Author

jchrys commented Aug 15, 2023

Thank you for the detailed explanation! I will delve into this and analyze it. 😄

@jchrys
Copy link
Contributor Author

jchrys commented Aug 20, 2023

Hello, @franz1981.
I discovered that the intrinsic implementation of String.indexOf(latin) leverages AVX and AVX2 in intel x86-64 machines.
it appears that the function performs comparison using 32-byte and 16-byte blocks. and incase where the remaining bytes are less than 16, it employs single-byte comparison against the byteToFind value.

The following assembly code has been extracted from perfasm output. the accompanying comments might not be accurate. (I added them for my own reference). (x86_64, 1x10x2, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, openjdk 17.0.8, Ubuntu 22.04.3 LTS)

0.12%          0x00007f748109d8f0:   lea    0x10(%rsi,%rcx,1),%rdi
   0.46%       0x00007f748109d8f5:   mov    %rdi,%rbx
   1.43%       0x00007f748109d8f8:   cmp    $0x10,%edx                  ; compare length, 16 (%edx = length)
               0x00007f748109d8fb:   jl     0x00007f748109d97a          ; jump if length is less than 16
   2.10%       0x00007f748109d901:   cmp    $0x20,%edx                  ; compare length, 32
0x00007f748109d904:   jl     0x00007f748109d94           ; jump if length is less than 32
   0.98%  │    0x00007f748109d90a:   vmovd  %eax,%xmm0                  ; move eax(charToFind) to %xmm0(128-bit register).
   0.55%  │    0x00007f748109d90e:   vpbroadcastb %xmm0,%ymm0           ; broadcast xmm0 to ymm0(256-bit register)
   1.41%  │    0x00007f748109d913:   vpxor  %ymm1,%ymm1,%ymm1           ; zeroout ymm1
   0.85%  │    0x00007f748109d917:   mov    %edx,%ecx                   ; copy edx(length) to ecx
   0.94%  │    0x00007f748109d919:   and    $0xffffffe0,%ecx            ; get 256bit count (ecx = length)
   0.45%  │    0x00007f748109d91c:   and    $0x1f,%edx                  ; get less than 256bit count (ecx = length)
   1.60%  │↗   0x00007f748109d91f:   vmovdqu (%rbx),%ymm2               ; load 32byte from underlying array to ymm2
   4.20%  ││   0x00007f748109d923:   vpcmpeqb %ymm0,%ymm2,%ymm2         ; compare each byte in ymm0 and ymm2 and store result in ymm2
  11.46%  ││   0x00007f748109d927:   vptest %ymm2,%ymm1                 ; perform test 
   2.04%  ││   0x00007f748109d92c:   jae    0x00007f748109d99b          ; jump if found 
   3.48%  ││   0x00007f748109d932:   add    $0x20,%rbx                  ; advance rbx by 32
   2.93%  ││   0x00007f748109d936:   sub    $0x20,%ecx                  ; substract ecx(256bit count) by 32
          │╰   0x00007f748109d939:   jne    0x00007f748109d91f          ; jump if ecx is not equal to zero
   0.05%  │ ╭  0x00007f748109d93b:   jmp    0x00007f748109d94d          ; unconditional jump
          ↘ │  0x00007f748109d940:   vmovd  %eax,%xmm0                  ; move eax(byteToFind) to xmm0(128bit)
0x00007f748109d944:   vpxor  %xmm1,%xmm1,%xmm1           ; zero-out xmm1
0x00007f748109d948:   vpshufb %xmm1,%xmm0,%xmm0          ; 
   0.07%    ↘  0x00007f748109d94d:   cmp    $0x10,%edx                  ; compare edx,16
               0x00007f748109d950:   jl     0x00007f748109d97a          ; jump if less than 16
               0x00007f748109d956:   mov    %edx,%ecx                   ; copy edx(length) to ecx
               0x00007f748109d958:   and    $0xfffffff0,%ecx            ; 16byte count
               0x00007f748109d95b:   and    $0xf,%edx                   ; less than less than 16byte count
               0x00007f748109d95e:   vmovdqu (%rbx),%xmm2               ; load 16bytes from rbx to xmm2
               0x00007f748109d962:   vpcmpeqb %xmm0,%xmm2,%xmm2         ; compare xmm0 and xmm2 and save result to xmm2
               0x00007f748109d966:   vptest %xmm2,%xmm1                 ; test xmm2 and xmm1
               0x00007f748109d96b:   jae    0x00007f748109d99b          ; jump if test result is found
               0x00007f748109d971:   add    $0x10,%rbx                  ; advance rbx by 16
               0x00007f748109d975:   sub    $0x10,%ecx                  ; subtract remaining by 16
               0x00007f748109d978:   jne    0x00007f748109d95e          ; jump if remaining is not 0
               0x00007f748109d97a:   test   %edx,%edx                   ; test if length is 0
               0x00007f748109d97c:   je     0x00007f748109d994          ; jump, if length is 0
               0x00007f748109d982:   movzbl (%rbx),%ecx                 ; move rbx to ecx register(array idx)
               0x00007f748109d985:   cmp    %ecx,%eax                   ; compare byteToFind(eax) for and cur targeting byte(ecx)
               0x00007f748109d987:   je     0x00007f748109d9a5          ; jump, if found
               0x00007f748109d989:   add    $0x1,%rbx                   ; advance arrayIdx
               0x00007f748109d98d:   sub    $0x1,%edx                   ; decrement length
               0x00007f748109d990:   je     0x00007f748109d994          ; jump, if (edx = length) is zero
               0x00007f748109d992:   jmp    0x00007f748109d982          ; unconditional jump
               0x00007f748109d994:   mov    $0xffffffff,%ebx            ; not found return -1
               0x00007f748109d999:   jmp    0x00007f748109d9a8          ; 
               0x00007f748109d99b:   vpmovmskb %ymm2,%ecx               ; dealing with vector result
               0x00007f748109d99f:   bsf    %ecx,%eax                   ;

The chart below presents a comparison between AsciiString.indexOf()(Green) and AsciiString.toString().indexOf()(Blue).

Screenshot 2023-08-20 at 5 15 45 PM (Please let me know if you require additional information)

I believe that 16 could be a suitable candidate for the cutoff value. What are your thoughts on this?

@franz1981
Copy link
Contributor

Excellent analysis @jchrys well done!!
Soooo...just a last question and we are ready to go:
The toString.indexOf case isn't creating the String right? It's leveraging a cache one I suppose....

Related the cut-off: the idea is to check if the cached String exist only if the size is less than 16 than use the right method.
I suppose that the unrolled trick I have made it's a bit making results unfair, 'cause the stride by which the JIT unroll its 16 char long loop still require more comparisons, branches and, in general, instructions. You can verify it by running a single experiment adding perfnorm numbers.
See https://blogs.oracle.com/javamagazine/post/loop-unrolling for some reference.

@jchrys
Copy link
Contributor Author

jchrys commented Aug 20, 2023

Hello, @franz1981 . Thanks a lot! 😃

Soooo...just a last question and we are ready to go:
The toString.indexOf case isn't creating the String right? It's leveraging a cache one I suppose....

Yes, that's correct. It uses a cache and doesn't create a new String. I cached it during the initialization phase. (Link)

    public static Object blackhole;

    @Setup(Level.Trial)
    @SuppressJava6Requirement(reason = "using SplittableRandom to reliably produce data")
    public void init() {
       ---snip ---
        for (int i = 0; i < permutations; ++i) {
            ---snip---
            data[i] = new AsciiString(byteArray);
            blackhole = data[i].toString(); // cache
        }
    }

I will look into the topic of Related the cut-off

@plokhotnyuk
Copy link

@jchrys Could you please check how performance scale with 2, 4, 8, 16 threads using '@thread' annotation of JMH?

@jchrys
Copy link
Contributor Author

jchrys commented Aug 20, 2023

@plokhotnyuk Sure 😄 I will notify you once I have the results. (It might take some time)

@franz1981
Copy link
Contributor

franz1981 commented Aug 20, 2023

@plokhotnyuk hi! Why with multiple threads? Vectorization isn't involved on Netty side (although we compare against)...are you worried of some instability in the system?

@jchrys remember to use numactl --localalloc -N 0 or whatever node, tuned latency profile, disable turbo boost, while running it.
If you cannot do any of this, just UseNumanis better than nothing...

@jchrys
Copy link
Contributor Author

jchrys commented Aug 20, 2023

Hi, @franz1981.
Thanks! I will review and conform those configurations.

Regarding the tuned latency profile, would the network-latency(Optimize for deterministic performance at the cost of increased power consumption, focused on low latency network performance) profile be suitable for the purpose?

@franz1981
Copy link
Contributor

Yep @jchrys it should be ok!
Last 2 things, but not least; the benchmark where is placing the found element? I think is not fair between String/AsciiString...because the former will get more comparisons if it hit the 16-sized tail loop while AsciiString (always) not.

Very last point is: in the best case for SIMD version eg 64 the performance is not (at all) near to the expected speedup if compared to SWAR too eg 4 times.
This could be due to cache bandwidth limitation or...some other quirks: to know that it would be great to report how cache misses changes while varying the input size.

These 2 last points or not related the PR but just for scientific curiosity: I would expect the JDK to outperform much more the purely Java version by more than just 2X, best case...unless the bottleneck is elsewhere (reducing the number of inputs/dataset to the bare minimum should help there as well, given that are both branchless and we can always use 64 as target size for both).

@plokhotnyuk
Copy link

@franz1981 @jchrys Usually Netty-based services use several threads, so it is quite natural to run benchmarks using multiple threads.

Here is an example of comparison between 1 and 16 threads, so you can easily see that using vectorization, allocations or concurrent data structures reduce scalability over cores.

@franz1981
Copy link
Contributor

franz1981 commented Aug 21, 2023

Sorry @plokhotnyuk let me gently disagree here: the usual context of execution of Netty here doesn't apply to very fine grain (micro, really) benchmarks like these.

Unless there are concerns related the scalability of vectorized instructions, I don't see a point to use multiple threads for a type of benchmark which doesn't involve the full Netty pipeline stack nor any form of sharing/concurrent data structures (not allocation intensive operations). Running the same benchmark with -prof gc will help to show the last point related allocations.

@jchrys
Copy link
Contributor Author

jchrys commented Aug 25, 2023

Hello, @franz1981

I suppose that the unrolled trick I have made it's a bit making results unfair, 'cause the stride by which the JIT unroll its 16 char long loop still require more comparisons, branches and, in general, instructions. You can verify it by running a single experiment adding perfnorm numbers.
See https://blogs.oracle.com/javamagazine/post/loop-unrolling for some reference.

Could you please explain a bit more about the paragraph above? I'm having trouble understanding why using unrolledFirstIndexOf could cause things to be unfair.

@normanmaurer
Copy link
Member

What is the state of this one ?

@jchrys
Copy link
Contributor Author

jchrys commented Mar 13, 2024

Let me work on this issue. I will revisit it.

Motivation:
Currently, `AsciiString#indexOf` method using naive
iteration algorithm.

Modification:
Utilize SWAR technique.

Result:
Faster `AsciiString#indexOf`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Applying SWAR Technique to AsciiString
5 participants