Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster query string decoder #13473

Open
wants to merge 2 commits into
base: 4.1
Choose a base branch
from

Conversation

franz1981
Copy link
Contributor

Motivation:

decodeParams is not accounting for common decoding scenarios
making it highly unpredictable.

Modifications:

Arrange control flow to isolated predictable from unpredictable
checks to improve CPU usage (stalled frontend cycles)

Result:

Faster query decoding

@franz1981
Copy link
Contributor Author

@vietj too

@franz1981
Copy link
Contributor Author

franz1981 commented Jul 7, 2023

These are the numbers with these changes:

Benchmark                                                Mode  Cnt   Score   Error   Units
QueryStringDecoderBenchmark.mixedDecoding               thrpt   40   3.278 ± 0.043  ops/us
QueryStringDecoderBenchmark.noDecoding                  thrpt   40  12.922 ± 0.041  ops/us
QueryStringDecoderBenchmark.nonStandardDecoding         thrpt   40   4.339 ± 0.014  ops/us
QueryStringDecoderBenchmark.onlyDecoding                thrpt   40   3.804 ± 0.015  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameter  thrpt   40  23.210 ± 0.404  ops/us

compared to before:

Benchmark                                                Mode  Cnt   Score   Error   Units
QueryStringDecoderBenchmark.mixedDecoding               thrpt   40   2.641 ± 0.170  ops/us
QueryStringDecoderBenchmark.noDecoding                  thrpt   40  11.413 ± 0.112  ops/us
QueryStringDecoderBenchmark.nonStandardDecoding         thrpt   40   3.219 ± 0.058  ops/us
QueryStringDecoderBenchmark.onlyDecoding                thrpt   40   3.239 ± 0.114  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameter  thrpt   40  22.001 ± 0.404  ops/us

This is with

VM version: JDK 17.0.7, Java HotSpot(TM) 64-Bit Server VM, 17.0.7+8-LTS-224

and

AMD Ryzen™ 9 7950X × 32@4.5 GHz

given that the most costy operation there is the component decoding, the improvements are ~5-30%: not bad, but in some next PR I can improve the rest too.

@franz1981 franz1981 force-pushed the batch_query_decoder branch 2 times, most recently from 63da020 to cd1100d Compare July 7, 2023 13:42
@franz1981
Copy link
Contributor Author

franz1981 commented Jul 7, 2023

@viet I've added a new inteface (bad name, to improve), which could save users which have to perform
single shot decoding, to collect (and likely copy too) the decoded parameters in the provided linked map (IIRC Vertx prefer a MultiMap).

wdyt?

@chrisvest I'm opened to ideas to provide a decent API here: impl wise I've preferred to do some "dirty trick"
with parameters/return types to avoid copying the core algorithm twice.

Perf-wise these is what we get with the new API:

Benchmark                                                     Mode  Cnt   Score   Error   Units
QueryStringDecoderBenchmark.mixedDecodingNoMap               thrpt   40   3.722 ± 0.070  ops/us
QueryStringDecoderBenchmark.noDecodingNoMap                  thrpt   40  20.763 ± 0.040  ops/us
QueryStringDecoderBenchmark.nonStandardDecodingNoMap         thrpt   40   5.231 ± 0.017  ops/us
QueryStringDecoderBenchmark.onlyDecodingNoMap                thrpt   40   4.157 ± 0.019  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameterNoMap  thrpt   40  38.208 ± 0.855  ops/us

Which remove the map allocation from the equation (that's unfair: users will likely have their own bespoke map there!)

@franz1981
Copy link
Contributor Author

@chrisvest when the API and the PR is stable I'll write/modify the existing tests to use it.

@franz1981
Copy link
Contributor Author

I Will perform some assembly analysis because some of the numbers are really weird and I suspect there is something going on inlining side.

@franz1981
Copy link
Contributor Author

franz1981 commented Jul 8, 2023

The same results on my Intel machine (Intel® Core™ i7-9850H × 12@4.4 Ghz instead are (on JDK 17):

This pr:

Benchmark                                                Mode  Cnt   Score   Error   Units
QueryStringDecoderBenchmark.mixedDecoding               thrpt   20   1.517 ± 0.036  ops/us
QueryStringDecoderBenchmark.noDecoding                  thrpt   20   7.212 ± 0.095  ops/us
QueryStringDecoderBenchmark.nonStandardDecoding         thrpt   20   2.055 ± 0.020  ops/us
QueryStringDecoderBenchmark.onlyDecoding                thrpt   20   1.918 ± 0.031  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameter  thrpt   20  13.406 ± 0.336  ops/us

4.1:

Benchmark                                                Mode  Cnt   Score   Error   Units
QueryStringDecoderBenchmark.mixedDecoding               thrpt   20   1.400 ± 0.018  ops/us
QueryStringDecoderBenchmark.noDecoding                  thrpt   20   6.143 ± 0.054  ops/us
QueryStringDecoderBenchmark.nonStandardDecoding         thrpt   20   1.677 ± 0.027  ops/us
QueryStringDecoderBenchmark.onlyDecoding                thrpt   20   1.837 ± 0.023  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameter  thrpt   20  11.653 ± 0.138  ops/us

that are similar.
With JDK 8 instead, the cost is more biased around decoding name/value parameter and the importance of parsing is lessen (meaning that their contribution on the performance improvement is slower).
I will perform some more deeper analysis here, although the existing improvement is already decent IMO

@franz1981 franz1981 force-pushed the batch_query_decoder branch 5 times, most recently from 3e6605c to a821f4e Compare July 9, 2023 14:02
@franz1981
Copy link
Contributor Author

@chrisvest I didn't overloaded the static methods to perform parameter decoding yet: do you think it to be a nice addition?
I've supposed it wasn't a widely adopted feature, really, but useful, so didn't wanted to make it "heavy" by adding many different methods.

Test-wise, I've added an additional assertParameters sanity check for all existing tests which perform some form of parameter validation, but it is using the existing instance method only.
I can add calling the other static ones, which perform some null checks,let me know your opinion here

@franz1981
Copy link
Contributor Author

I have another improvement coming from the decodeComponent part, hold on!

@vietj
Copy link
Contributor

vietj commented Jul 20, 2023

It would be great if we could have a static way of populating existing data structures, like

  public interface Populator<T> {

    void setQueryParam(String name, String value, T map);

  }

  public class MapPopulator implements Populator<Map<String, String>> {
    @Override
    public void setQueryParam(String name, String value, Map<String, String> map) {
      map.put(name, value);
    }
  }

@franz1981
Copy link
Contributor Author

thanks @vietj for the feedback, it's a great idea indeed!

@franz1981
Copy link
Contributor Author

After reading the different versions of the assembly produced by the benchmark I believe is it worthy to analyze what happen if the input sequence makes the JIT to decide for a different layout and order of the evaluated conditions (for the decoder state machine).

Hence, I've added a parameter to "confuse" JIT, and these are the numbers...
This PR:

Benchmark                                                       (confuseJIT)   Mode  Cnt   Score   Error   Units
QueryStringDecoderBenchmark.mixedDecodingNoMap                         false  thrpt   40   3.531 ± 0.081  ops/us
QueryStringDecoderBenchmark.mixedDecodingNoMap                          true  thrpt   40   3.393 ± 0.043  ops/us
QueryStringDecoderBenchmark.noDecodingNoMap                            false  thrpt   40  21.158 ± 0.311  ops/us
QueryStringDecoderBenchmark.noDecodingNoMap                             true  thrpt   40  18.403 ± 0.460  ops/us
QueryStringDecoderBenchmark.nonStandardDecodingNoMap                   false  thrpt   40   5.163 ± 0.302  ops/us
QueryStringDecoderBenchmark.nonStandardDecodingNoMap                    true  thrpt   40   4.104 ± 0.101  ops/us
QueryStringDecoderBenchmark.onlyDecodingNoMap                          false  thrpt   40   4.192 ± 0.007  ops/us
QueryStringDecoderBenchmark.onlyDecodingNoMap                           true  thrpt   40   3.775 ± 0.068  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameterNoMap            false  thrpt   40  37.600 ± 0.190  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameterNoMap             true  thrpt   40  30.313 ± 0.167  ops/us
QueryStringDecoderBenchmark.singleResourceThreeParametersNoMap         false  thrpt   40  14.004 ± 0.106  ops/us
QueryStringDecoderBenchmark.singleResourceThreeParametersNoMap          true  thrpt   40  12.960 ± 0.265  ops/us

4.1 but changed to work with the *NoMap benchmarks:

Benchmark                                                       (confuseJIT)   Mode  Cnt   Score   Error   Units
QueryStringDecoderBenchmark.mixedDecodingNoMap                         false  thrpt   40   3.242 ± 0.096  ops/us
QueryStringDecoderBenchmark.mixedDecodingNoMap                          true  thrpt   40   2.543 ± 0.108  ops/us
QueryStringDecoderBenchmark.noDecodingNoMap                            false  thrpt   40  15.523 ± 0.135  ops/us
QueryStringDecoderBenchmark.noDecodingNoMap                             true  thrpt   40  17.761 ± 0.951  ops/us
QueryStringDecoderBenchmark.nonStandardDecodingNoMap                   false  thrpt   40   4.318 ± 0.063  ops/us
QueryStringDecoderBenchmark.nonStandardDecodingNoMap                    true  thrpt   40   3.530 ± 0.070  ops/us
QueryStringDecoderBenchmark.onlyDecodingNoMap                          false  thrpt   40   3.219 ± 0.312  ops/us
QueryStringDecoderBenchmark.onlyDecodingNoMap                           true  thrpt   40   2.810 ± 0.090  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameterNoMap            false  thrpt   40  38.327 ± 0.335  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameterNoMap             true  thrpt   40  26.263 ± 1.522  ops/us
QueryStringDecoderBenchmark.singleResourceThreeParametersNoMap         false  thrpt   40   9.730 ± 0.045  ops/us
QueryStringDecoderBenchmark.singleResourceThreeParametersNoMap          true  thrpt   40  10.295 ± 0.674  ops/us

In short: the improvement is almost there, in many/most of the cases.
What I've noticed still by inspecting the assembly is that the main decoding method requires some spilling to happen
which could be saved, but the code is already very complex as it is, so I won't go down that rabbit hole.

I'll now impl the changes suggested by @vietj and we're ready to go.

@franz1981
Copy link
Contributor Author

303b2b2f2c9dacafb1044e0dd7a25e34a0fa0e3b has delivered another interesting speedup in almost all cases but nonStandardDecodingNoMap without confusingJIT

Benchmark                                                       (confuseJIT)   Mode  Cnt   Score   Error   Units
QueryStringDecoderBenchmark.mixedDecodingNoMap                         false  thrpt   40   3.408 ± 0.116  ops/us
QueryStringDecoderBenchmark.mixedDecodingNoMap                          true  thrpt   40   3.685 ± 0.114  ops/us
QueryStringDecoderBenchmark.noDecodingNoMap                            false  thrpt   40  21.339 ± 0.236  ops/us
QueryStringDecoderBenchmark.noDecodingNoMap                             true  thrpt   40  21.085 ± 0.678  ops/us
QueryStringDecoderBenchmark.nonStandardDecodingNoMap                   false  thrpt   40   4.856 ± 0.178  ops/us
QueryStringDecoderBenchmark.nonStandardDecodingNoMap                    true  thrpt   40   4.667 ± 0.041  ops/us
QueryStringDecoderBenchmark.onlyDecodingNoMap                          false  thrpt   40   4.084 ± 0.065  ops/us
QueryStringDecoderBenchmark.onlyDecodingNoMap                           true  thrpt   40   3.959 ± 0.141  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameterNoMap            false  thrpt   40  38.688 ± 0.342  ops/us
QueryStringDecoderBenchmark.singleResourceOneParameterNoMap             true  thrpt   40  35.615 ± 0.295  ops/us
QueryStringDecoderBenchmark.singleResourceThreeParametersNoMap         false  thrpt   40  14.442 ± 0.163  ops/us
QueryStringDecoderBenchmark.singleResourceThreeParametersNoMap          true  thrpt   40  13.737 ± 0.512  ops/us

this seems worthy especially in case JIT branch profiling is polluted

@franz1981
Copy link
Contributor Author

@vietj @chrisvest the only downside of unifying the map/callback (with context) approaches is that it always allocate an empty linked map in case no parameters are found, while on 4.1 it just return an empty (singleton) one.
Any suggestion is appreciated to restore the previous behaviour and/or decide that is not worthy...

@vietj
Copy link
Contributor

vietj commented Aug 21, 2023

@franz1981 can you lazy create the collector and add a new method to create the collector on the parameter collector ?

interface ParameterCollector<C> {
  C newCollector();
  ...
}

Otherwise use similar to Map.compute(K key, BiFunction<? super K, ? super V, ? extends V> remappingFunction) where you pass null first time and the collector is responsible to create the instance and return it to subsquently use it

@@ -191,16 +187,61 @@ public String path() {
return path;
}

@UnstableApi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this annotation as its package-private anyway

/**
* Decodes {@link #uri()} parameters, reporting them in the provided {@code collector}'s map.
*/
public void decodeParameters(Map<String, List<String>> parameters) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think renaming the param makes things more clear

Suggested change
public void decodeParameters(Map<String, List<String>> parameters) {
public void decodeParameters(Map<String, List<String>> out) {

* and just store/report/filter them assuming random ordering, instead.
*/
public interface ParameterCollector<C> {
void accept(String name, String value, C accumulator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void accept(String name, String value, C accumulator);
void collect(String name, String value, C accumulator);

}

private static int getFirstEscaped(String s, int from, int toExcluded, boolean isPath) {
int cutOff = !isPath? '+' : '%';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this easier to read

Suggested change
int cutOff = !isPath? '+' : '%';
int cutOff = isPath? '%' : '+' ;

public void accept(String name, String value, Map<String, List<String>> accumulator) {
List<String> values = accumulator.get(name);
if (values == null) {
values = new ArrayList<String>(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 1 a good default ? Maybe use 4 so we dont need to expand directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've no idea - I would expect that query params usually have a single value for each param, most of the time, rather then more, but I have not much exp in it

return -1;
}

private static int skipIf(String s, int from, int to, int ch) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static int skipIf(String s, int from, int to, int ch) {
private static int skipIf(CharSequence s, int from, int to, int ch) {

}

private static boolean addParam(String s, int nameStart, int valueStart, int valueEnd,
Map<String, List<String>> params, Charset charset) {
private static int indexOf(String s, int from, int to, int ch) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static int indexOf(String s, int from, int to, int ch) {
private static int indexOf(CharSequence s, int from, int to, int ch) {

return strBuf.toString();
}

private static int findPathEndIndex(String uri) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static int findPathEndIndex(String uri) {
private static int findPathEndIndex(CharSequence uri) {

Motivation:

decodeParams is not accounting for common decoding scenarios
making it highly unpredictable.

Modifications:

Arrange control flow to isolated predictable from unpredictable
checks to improve CPU usage (stalled frontend cycles)

Result:

Faster query decoding
@normanmaurer
Copy link
Member

normanmaurer commented Feb 21, 2024

@franz1981 what is the status of this PR ?

@franz1981
Copy link
Contributor Author

Sorry @normanmaurer I have parked this due to others which required more attention; I will change it again to unify some of the existing code paths

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants