New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT]Enhance Performance of AsciiString Methods #13534
base: 4.1
Are you sure you want to change the base?
Conversation
This PR is in a very early phase, and I've just implemented the algorithm. |
Well done @jchrys just few hints:
I will check the bit trick part with a paper with calm (the uppercase/lower case ones) |
Last but not least, the math seem sounds (but I really nee to double check again), but I have no idea if there s any reference about it (especially the ranged ones) - do you have any link to share? |
I feel like this could be simplified a lot by recognising that upper and lower case ASCII characters differ only by 0x20 (lower case characters intersect with 0x20, upper case letters do not). So to do a case insensitive search you compile a pattern for the character you're looking for (e.g. |
Oh, and it goes without saying that |
Thanks @richardstartin ! |
@franz1981 |
Hi @jchrys please check the suggestions from @richardstartin and the link I have shared; it's likely that we don't need that complexity for the lowercase/uppercase check and translation. |
@franz1981 |
Hello, @franz1981. I have the results to share.
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, openjdk 17.0.8 2023-07-18, Ubuntu 22.04.3 LTS.(benchmark branch)
|
The folly impl is expected to be faster, so, happy that numbers reflect it. I think the approach should be used for both (upper/lower case conversions and case insensitive comparisons). |
Thank you for the detailed explanation! I will delve into this and analyze it. 😄 |
Hello, @franz1981. The following assembly code has been extracted from perfasm output. the accompanying comments might not be accurate. (I added them for my own reference). (x86_64, 1x10x2, Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz, openjdk 17.0.8, Ubuntu 22.04.3 LTS) 0.12% 0x00007f748109d8f0: lea 0x10(%rsi,%rcx,1),%rdi
0.46% 0x00007f748109d8f5: mov %rdi,%rbx
1.43% 0x00007f748109d8f8: cmp $0x10,%edx ; compare length, 16 (%edx = length)
0x00007f748109d8fb: jl 0x00007f748109d97a ; jump if length is less than 16
2.10% 0x00007f748109d901: cmp $0x20,%edx ; compare length, 32
╭ 0x00007f748109d904: jl 0x00007f748109d94 ; jump if length is less than 32
0.98% │ 0x00007f748109d90a: vmovd %eax,%xmm0 ; move eax(charToFind) to %xmm0(128-bit register).
0.55% │ 0x00007f748109d90e: vpbroadcastb %xmm0,%ymm0 ; broadcast xmm0 to ymm0(256-bit register)
1.41% │ 0x00007f748109d913: vpxor %ymm1,%ymm1,%ymm1 ; zeroout ymm1
0.85% │ 0x00007f748109d917: mov %edx,%ecx ; copy edx(length) to ecx
0.94% │ 0x00007f748109d919: and $0xffffffe0,%ecx ; get 256bit count (ecx = length)
0.45% │ 0x00007f748109d91c: and $0x1f,%edx ; get less than 256bit count (ecx = length)
1.60% │↗ 0x00007f748109d91f: vmovdqu (%rbx),%ymm2 ; load 32byte from underlying array to ymm2
4.20% ││ 0x00007f748109d923: vpcmpeqb %ymm0,%ymm2,%ymm2 ; compare each byte in ymm0 and ymm2 and store result in ymm2
11.46% ││ 0x00007f748109d927: vptest %ymm2,%ymm1 ; perform test
2.04% ││ 0x00007f748109d92c: jae 0x00007f748109d99b ; jump if found
3.48% ││ 0x00007f748109d932: add $0x20,%rbx ; advance rbx by 32
2.93% ││ 0x00007f748109d936: sub $0x20,%ecx ; substract ecx(256bit count) by 32
│╰ 0x00007f748109d939: jne 0x00007f748109d91f ; jump if ecx is not equal to zero
0.05% │ ╭ 0x00007f748109d93b: jmp 0x00007f748109d94d ; unconditional jump
↘ │ 0x00007f748109d940: vmovd %eax,%xmm0 ; move eax(byteToFind) to xmm0(128bit)
│ 0x00007f748109d944: vpxor %xmm1,%xmm1,%xmm1 ; zero-out xmm1
│ 0x00007f748109d948: vpshufb %xmm1,%xmm0,%xmm0 ;
0.07% ↘ 0x00007f748109d94d: cmp $0x10,%edx ; compare edx,16
0x00007f748109d950: jl 0x00007f748109d97a ; jump if less than 16
0x00007f748109d956: mov %edx,%ecx ; copy edx(length) to ecx
0x00007f748109d958: and $0xfffffff0,%ecx ; 16byte count
0x00007f748109d95b: and $0xf,%edx ; less than less than 16byte count
0x00007f748109d95e: vmovdqu (%rbx),%xmm2 ; load 16bytes from rbx to xmm2
0x00007f748109d962: vpcmpeqb %xmm0,%xmm2,%xmm2 ; compare xmm0 and xmm2 and save result to xmm2
0x00007f748109d966: vptest %xmm2,%xmm1 ; test xmm2 and xmm1
0x00007f748109d96b: jae 0x00007f748109d99b ; jump if test result is found
0x00007f748109d971: add $0x10,%rbx ; advance rbx by 16
0x00007f748109d975: sub $0x10,%ecx ; subtract remaining by 16
0x00007f748109d978: jne 0x00007f748109d95e ; jump if remaining is not 0
0x00007f748109d97a: test %edx,%edx ; test if length is 0
0x00007f748109d97c: je 0x00007f748109d994 ; jump, if length is 0
0x00007f748109d982: movzbl (%rbx),%ecx ; move rbx to ecx register(array idx)
0x00007f748109d985: cmp %ecx,%eax ; compare byteToFind(eax) for and cur targeting byte(ecx)
0x00007f748109d987: je 0x00007f748109d9a5 ; jump, if found
0x00007f748109d989: add $0x1,%rbx ; advance arrayIdx
0x00007f748109d98d: sub $0x1,%edx ; decrement length
0x00007f748109d990: je 0x00007f748109d994 ; jump, if (edx = length) is zero
0x00007f748109d992: jmp 0x00007f748109d982 ; unconditional jump
0x00007f748109d994: mov $0xffffffff,%ebx ; not found return -1
0x00007f748109d999: jmp 0x00007f748109d9a8 ;
0x00007f748109d99b: vpmovmskb %ymm2,%ecx ; dealing with vector result
0x00007f748109d99f: bsf %ecx,%eax ; The chart below presents a comparison between I believe that 16 could be a suitable candidate for the cutoff value. What are your thoughts on this? |
Excellent analysis @jchrys well done!! Related the cut-off: the idea is to check if the cached String exist only if the size is less than 16 than use the right method. |
Hello, @franz1981 . Thanks a lot! 😃
Yes, that's correct. It uses a cache and doesn't create a new String. I cached it during the initialization phase. (Link) public static Object blackhole;
@Setup(Level.Trial)
@SuppressJava6Requirement(reason = "using SplittableRandom to reliably produce data")
public void init() {
---snip ---
for (int i = 0; i < permutations; ++i) {
---snip---
data[i] = new AsciiString(byteArray);
blackhole = data[i].toString(); // cache
}
} I will look into the topic of |
@plokhotnyuk Sure 😄 I will notify you once I have the results. (It might take some time) |
@plokhotnyuk hi! Why with multiple threads? Vectorization isn't involved on Netty side (although we compare against)...are you worried of some instability in the system? @jchrys remember to use numactl --localalloc -N 0 or whatever node, tuned latency profile, disable turbo boost, while running it. |
Hi, @franz1981. Regarding the tuned latency profile, would the |
Yep @jchrys it should be ok! Very last point is: in the best case for SIMD version eg 64 the performance is not (at all) near to the expected speedup if compared to SWAR too eg 4 times. These 2 last points or not related the PR but just for scientific curiosity: I would expect the JDK to outperform much more the purely Java version by more than just 2X, best case...unless the bottleneck is elsewhere (reducing the number of inputs/dataset to the bare minimum should help there as well, given that are both branchless and we can always use 64 as target size for both). |
@franz1981 @jchrys Usually Netty-based services use several threads, so it is quite natural to run benchmarks using multiple threads. Here is an example of comparison between 1 and 16 threads, so you can easily see that using vectorization, allocations or concurrent data structures reduce scalability over cores. |
Sorry @plokhotnyuk let me gently disagree here: the usual context of execution of Netty here doesn't apply to very fine grain (micro, really) benchmarks like these. Unless there are concerns related the scalability of vectorized instructions, I don't see a point to use multiple threads for a type of benchmark which doesn't involve the full Netty pipeline stack nor any form of sharing/concurrent data structures (not allocation intensive operations). Running the same benchmark with -prof gc will help to show the last point related allocations. |
Hello, @franz1981
Could you please explain a bit more about the paragraph above? I'm having trouble understanding why using |
What is the state of this one ? |
Let me work on this issue. I will revisit it. |
Motivation: Currently, `AsciiString#indexOf` method using naive iteration algorithm. Modification: Utilize SWAR technique. Result: Faster `AsciiString#indexOf`.
Motivation:
AsciiString
is a specialized class designed for ASCII character handling. However, it currently lacks specialized algorithms, which hinders its performance. but by implementing SWAR and other techniques, we can significantly improve their performance.Modification:
Implemented branchless bulk case conversion for 8 characters at a time.
Optimized checks for the existence of lowercase or uppercase characters.
Utilized SWAR to efficiently find a given byte.
Result:
Improved performance.
Resolves #13522