Optimize the concurrent performance of Cpp target by more than 10 times #4237

wangtao9 · 2023-04-18T07:29:00Z

Usage:
~~add -lock-free-cpp-target option when generating parser, e.g.~~
~~java -jar ${ANTLR_JAR} -Dlanguage=Cpp -lock-free-cpp-target Cypher.g4~~
add compile option -DANTLR4_USE_THREAD_LOCAL_CACHE=1 when compiling the Cpp lexer & parser.

Related issues:
#2454
#2584
#3938
Why the C++ target is 6X slower than the Java target

Optimization result:

Test configuration:
Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz ; Cores: 16 ; Logical processors: 32
256GB memory
grammar file: https://s3.amazonaws.com/artifacts.opencypher.org/M21/Cypher.g4
test query:

    MATCH (n:Person {id:2000})-[:knows]->(friend)-[:located_in]->(city)
    RETURN friend, count(city) ORDER BY friend.id LIMIT 100

Signed-off-by: wangtao9 <wangtaofighting@163.com>

and/or parser (default OFF) Signed-off-by: wangtao9 <wangtaofighting@163.com>

KvanTTT · 2023-04-18T10:57:01Z

This optimization looks great.

jimidle · 2023-04-18T14:44:54Z

I have done a similar build config for go. It does make a difference. I didn’t get 10x from go, but there could be all sorts of reasons for that.

Is your input example all you tried? Your mileage may vary on different input. I’ll try the same thing with go as well.

It’s 22:45 where I live, so I’ll try tomorrow

wangtao9 · 2023-04-19T01:32:33Z

I have done a similar build config for go. It does make a difference. I didn’t get 10x from go, but there could be all sorts of reasons for that.

Is your input example all you tried? Your mileage may vary on different input. I’ll try the same thing with go as well.

It’s 22:45 where I live, so I’ll try tomorrow

I've also tried simpler inputs such as "RETURN 1" and more complex examples of over 800 characters, both with significant improvements.

But more importantly, this optimization can achieve a huge performance improvement from a mechanical point of view. C++'s runtime handles locks differently from JVM, and it will fall into kernel calls more frequently, which is one of the main reasons why the concurrency performance of c++ target is much slower than that of java.

jimidle · 2023-04-19T02:00:29Z

Understood. Go used atomically mostly, it is actually pretty good at this stuff.

…

On Wed, Apr 19, 2023 at 09:32 Tao Wang ***@***.***> wrote: I have done a similar build config for go. It does make a difference. I didn’t get 10x from go, but there could be all sorts of reasons for that. Is your input example all you tried? Your mileage may vary on different input. I’ll try the same thing with go as well. It’s 22:45 where I live, so I’ll try tomorrow I've also tried simpler inputs such as "RETURN 1" and more complex examples of over 800 characters, both with significant improvements. But more importantly, this optimization can achieve a huge performance improvement from a mechanical point of view. C++'s runtime handles locks differently from JVM, and it will fall into kernel calls more frequently, which is one of the main reasons why the concurrency performance of c++ target is much slower than that of java. — Reply to this email directly, view it on GitHub <#4237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMESFU3TJQIQQEQOUDLXB46DZANCNFSM6AAAAAAXCGXR4M> . You are receiving this because you commented.Message ID: ***@***.***>

wangtao9 · 2023-04-24T06:44:13Z

@parrt Can this PR be merged?

tool/src/org/antlr/v4/tool/Grammar.java

tool/src/org/antlr/v4/codegen/model/Recognizer.java

parrt · 2023-04-24T19:45:42Z

Perhaps @hzeller has an opinion here, but frankly I'm terrified of Multi threaded version of the parsing strategy... It is incredibly tricky to get right and there are many people that rely on the C++ runtime.

parrt · 2023-04-24T19:47:50Z

OK I just looked at the code. Adding threads you are simply removing a lock I correct? I guess the question is how does it work without the lock in a multithreaded environment?

hzeller · 2023-04-24T21:16:06Z

I had a brief look - it changes one global variable with multiple thread_local ones. I don't know exactly the call sequence of the cpp runtime, so I don't know what happens then: it seems like now initialize() is called unconditionally every time instead of once which would be strange, but I think I first have to look exactly what that means by looking at the generated code. Maybe this evening.

Personally, I would anyway avoid doing something like

static Foo *foo = nullptr;
call_once(initialize_foo);

But more something like

  static Foo *foo = new Foo();

Then the memory model takes care of initializing that static field exactly once and possibly be more efficient than any call_once implementation, which was more a thing needed before the c++11 clarification of the memory model.

So I would create functions that create the static object and return the pointer, and then call it in such pattern:

  static StaticData *LexerStaticData = CreateLexerStaticData();  // or whatever that template expands to.

I'd make that unconditionally, don't add a define lockFreeCppTarget but always do it this way. We then can also remove code that provides call_once(), as it is only needed there.

I know @jcking was looking at multi-threaded performance, maybe he has come across this part of the code and maybe has some recollection if/why call_once() was needed ? I suspect it was some pre c++11 requirement.

hzeller · 2023-04-24T22:28:09Z

@jcking changed the once implementation to be either a local one or an absl one in this change, but it was not changing the call_once() need per-se. I suspect he did that because things are faster with absl.

With suggestions in my previous comment we can eliminate the pre-c++11 need to call_once() entirely and just do the static initialization.

wangtao9 · 2023-04-25T04:29:31Z

OK I just looked at the code. Adding threads you are simply removing a lock I correct? I guess the question is how does it work without the lock in a multithreaded environment?

@parrt Not exactly. It does not simply remove a lock, but changes the static data shared by multi-threads into one copy for each thread. Since the data is owned by each thread, locks are no longer relied upon to keep the data safe (although locks still exist).

wangtao9 · 2023-04-25T05:53:48Z

@hzeller
"eliminate the pre-c++11 need to call_once()" is a good thing, but it's not directly related to what this optimization does.

The idea of this optimization is to turn the static data shared by multi-threads into thread-owned, so as to avoid competition for locks.

<lexer.name>::initialize()/<parser.name>::initialize() is actually only called once in the constructor (called once per thread), so this optimization does not require call_once() or the equivalent semantics of C++11.

hzeller · 2023-04-25T05:59:09Z

Is the data structure modified in each thread ?

If not, we don't need locks, and can make the data structure const to make sure nobody does that in the future.

But if so, then it being static sounds like a bad idea. Thread local will fix that particular situation to not require locks then but it also means that there is something else going on and changing it to thread local wil change the semantics as now every thread sees a different content.

wangtao9 · 2023-04-25T07:33:05Z

@hzeller
The static data (mainly DFA, ATN, etc.) is constructed during the parse process, and will not be modified after the construction is completed. Most importantly, the above data structures constructed by multi-thread and single-thread are completely consistent.

I also verified this with experiments, as shown in the figure below, the left is the log of building DFA states with a single thread, and the right side is 4 threads doing the same thing. It can be seen that except for the different thread ids, the constructed DFA states are exactly the same.

ericvergnaud · 2023-04-25T09:28:17Z

So basically you're sacrificing reuse to avoid locks? I suspect this might be counter beneficial in terms if performance with complex grammars because each thread needs to rebuild the complete DFA instead of it being built just once? Is it possible that the locks that currently protect concurrent updates of the DFA are protecting too much code ?

Signed-off-by: wangtao9 <wangtaofighting@163.com>

wangtao9 · 2023-04-25T11:46:57Z

@ericvergnaud I think what you say makes sense, it could happen.
However, in my usage scenario (ultra-high concurrent requests), I care more about the scalability of multi-core performance, and this optimization is necessary. And in several scenarios I tested, it is effective.
To be on the safe side, I added an option -lock-free-cpp-target and turned off this optimization by default.

hzeller · 2023-04-25T18:26:07Z

So if the state is constructed once and then never modified, then the object can be const right ? If so, there is no need for any locks (and thus thread local).
So the fact that this apparently makes a difference sounds like the state is indeed modified in the various threads, so that there need to be locks ?

ericvergnaud · 2023-04-25T18:43:17Z

I’m supportive of this evolution where the behavior is actionable via a macro when compiling the runtime rather than when generating the parser Envoyé de mon iPhoneLe 25 avr. 2023 à 20:39, Ivan Kochurkin ***@***.***> a écrit : I thought about it again and now I'm not sure it's a good idea to include the option to generator. Why this runtime option can't be activated in runtime? It's more flexible solution since it doesn't require regeneration. At least, in C++ it's possible to use preprocessor directive. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

KvanTTT

I thought about it once again and now I'm don't think introducing the new option is a good idea. Why this runtime option can't be activated in runtime? It's more flexible solution since it doesn't require regeneration. At least, in C++ it's possible to use preprocessor directives:

#if USE_THREAD_LOCAL_CACHE
static thread_local
#endif
<lexer.name; format = "cap">StaticData *<lexer.grammarName; format = "lower">LexerStaticData = nullptr;

tool/resources/org/antlr/v4/tool/templates/codegen/Cpp/Cpp.stg

jimidle · 2023-04-26T02:05:48Z

In Go, I use build configurations that allow the runtime to be built in single threaded mode. Because the runtime is built at the same time as the generated parser etc then this works on a per project basis. C++ needs a different mechanism I guess as the library is pre-built. So -D build configuration and building different versions of the library seems reasonable. But in Go, my purpose is to elide locks altogether when the user knows that each instance will be independent - the default build is with mutexes. Things that are just statically initialized are done via a do.Once() and are not locked after that. I would change the static to just declare in line, but it needs to deserialize the lexer etc. and is small beer time wise. The locks are used for prediction cache etc. They are needed if multiple go routines are calling the same lexer/parser. I did not feel any need to bother changing code gen as the do.Once() mechanism is extremely fast anyway, and in single threaded mode it is basically just an atomic read once it has happened once. I have not found it overly useful to reuse the parser in multiple go routines (equivalent to threads) as there tends to be a lot of other work going on around it, which would mean I have to start mutexing all of that. It is likely the same for other targets - is anyone really using multiple threads on the same recognizer? If the grammar is well formed, then the overhead of one instance of my toolchains per go routine is essentially irrelevant. Poor grammars are basically just that and I am not going to do much more work improving things for poor grammars when the answer is to fix the grammar. Go is a bit different though as channels make it easy to spark up N workers and distribute the compile work in parallel. The parsing is generally trivial compared to everything else that needs doing, which needs multiple tree walks. So one go routine per translation unit and in single threaded build configuration works well. Locking contention with many go routines using one parser is measurable. However, I did not achieve the same speed increase that this PR suggests for the C++ runtime. Perhaps the locking in the C++ runtime needs some examination?

…

On Wed, Apr 26, 2023 at 2:39 AM Ivan Kochurkin ***@***.***> wrote: I thought about it again and now I'm not sure it's a good idea to include the option to generator. Why this runtime option can't be activated in runtime? It's more flexible solution since it doesn't require regeneration. At least, in C++ it's possible to use preprocessor directive. — Reply to this email directly, view it on GitHub <#4237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMBCD4WCDCWX6QEHL6TXDAK6TANCNFSM6AAAAAAXCGXR4M> . You are receiving this because you commented.Message ID: ***@***.***>

wangtao9 · 2023-04-26T03:20:55Z

@hzeller In C++, these data cannot be simply declared const, because we have no way to determine their values during the declaration phase, they are assigned during the parse process.

But I think there is a way (with major changes) to completely remove the use of locks. That would be a big upgrade, and similar optimization effect could be achieved.

wangtao9 · 2023-04-26T03:26:48Z

@KvanTTT I also support this modification, changing the generate-time option to compile-time.
I will make it happen.

updated:
I have done this modification and named the option ANTLR4_USE_THREAD_LOCAL_CACHE to avoid C++ macro conflict.

wangtao9 · 2023-04-26T05:29:39Z

@jimidle What is the multithreading speedup you measured in Go?

I suppose that the scalability should be much better than Cpp target (the speedup ratio of 32 threads compared to single thread is only 1.34)

jimidle · 2023-04-26T05:48:52Z

I have not done any formal testing of this as I prefer to think of it as experimental for this release. For one parser I did not get a huge improvement, but I know that when I use something like 32 routines (usually there is not much point using more than the number of cores though) there is an overall performance benefit because there is no lock contention. I will do some 'reality' testing at some point down the line. The other thing with so many threads is of course memory cache which takes exploring as to whether the effects are large. They sometimes can be very large. Hence I don't require the users to pick one or the other - they can change per project. That's easier in Go than most languages.

…

On Wed, Apr 26, 2023 at 1:29 PM Tao Wang ***@***.***> wrote: @jimidle <https://github.com/jimidle> What is the multithreading speedup you measured in Go? I suppose that the scalability should be much better than Cpp target (the speedup ratio of 32 threads compared to single thread is only 1.34) — Reply to this email directly, view it on GitHub <#4237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ7TMBBNQGTF2JTB4YMFHDXDCXE7ANCNFSM6AAAAAAXCGXR4M> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Signed-off-by: wangtao9 <wangtaofighting@163.com>

mike-lischke · 2023-05-01T08:26:56Z

Not sharing the DFA means each thread has to do the warmup phase on its own. What exactly does this approach improve? Keep in mind that the locks are only used while building up the DFA. It doesn't affect normal runtime behavior once this is done (well, actually, if new input comes up that wasn't parsed before there can still be modifications, but this has a very low overhead, compared to the initial warmup). Additionally, making the DFA thread local can have a serious memory impact.

To me the attempts to improve that part of the code seem to be efforts spent in the wrong spots. It's not the (very few) locks there that are only used while building up the DFA (whose impact could be lowered by running a parser in a single thread and prime it with typical input, until the DFA no longer grows significantly and after that start the other threads). Instead work should be spent in the adaptivePredict part, which has a much higher overall impact. First convert the recursion to iterations to avoid deep call stacks (which can also become a serious problem in threads, where you cannot set a bigger stack size) and then remove the need of std::shared_ptr. Manage the pointers in a different way and work only with raw pointers in this hot path. Maybe an own shared pointer class would be useful here, which does not have to be thread safe and hence doesn't use locks?

mike-lischke · 2023-05-01T08:32:58Z

One thing that would be a potential problem on C++ but not on Java/C# is the latter allows us to compute a value multiple times and then quietly discard all but one of them. The C++ target would be responsible for determining which one among multiple values actually ended up in the cache, and releasing the memory from all the remaining ones.

@sharwell, do you refer to the fact that sometimes references are replaced (e.g. during optimization/merge runs in PredictionContext)? I think this is handled properly by the shared pointers (and is one of the main reasons why shared pointers are used at all).

parrt · 2023-05-01T16:24:09Z

@wangtao9 are you measuring DFA warm up time here or are you measuring throughput after the parser is warmed up. Or both together? I would bet that after warm up the existing system gets higher throughput with multiple threads.

wangtao9 · 2023-05-04T02:03:34Z

@mike-lischke

Keep in mind that the locks are only used while building up the DFA. It doesn't affect normal runtime behavior once this is done

Most of what you say is correct, but I think this key point is not quite right. The locks are not "only used while building up the DFA", after the built is done, the read lock is used for EVERY read, resulting in poor concurrent performance of the C++ target (the speedup ratio of 32 threads is less than 2).
That's why I submitted this PR.

wangtao9 · 2023-05-04T02:11:03Z

@parrt
I measured one warmup plus many parses. Although not specific measured, warmup should be time-consuming. But warmup is a one-time job (whether it is all threads call once(before this optimization), or each thread call once(after this optimization)), as @mike-lischke mentioned, the subsequent incremental warmup "has a very low overhead, compared to the initial warmup".

wangtao9 · 2023-05-04T02:15:33Z

So if there is no correctness issue, then this LGTM.

The choice comes with more memory consumption and redundant computation, but that is a tradeoff the user can make. And for the single-threaded case, there is no disadvantage.

@hzeller Totally agree, exactly what I meant.

Signed-off-by: wangtao9 <wangtaofighting@163.com>

wangtao9 · 2023-05-04T04:21:43Z

Also, please add information about the new directive to the doc (cpp-target.md).

@KvanTTT Done.

mike-lischke · 2023-05-04T06:45:53Z

@wangtao9 OK, thanks for taking care to create such a patch, after analysing the code thoroughly. I also believe that the C++ runtime could benefit very much from removing locks (including shared_ptr). So I'm fine with your PR.

mike-lischke

Just fix that little typo.

doc/cpp-target.md

Signed-off-by: wangtao9 <wangtaofighting@163.com>

wangtao9 · 2023-05-04T07:54:54Z

@mike-lischke Thanks for your review. Like you I also think that removing locks is worth doing, maybe it can be put on the agenda in the near future? Until this big action is done, C++ runtime users can get similar benefits from this optimization :D

KvanTTT

Now it's OK for me but please also fix minor issues with cpp-target.md.

doc/cpp-target.md

Co-authored-by: Ivan Kochurkin <kvanttt@gmail.com> Signed-off-by: Tao Wang <wangtaofighting@163.com>

KvanTTT

Thanks a lot!

wangtao9 · 2023-05-06T02:53:48Z

Glad to contribute to antlr4 project! :D

parrt · 2023-05-06T19:45:06Z

Thanks, everyone, especially @wangtao9 !

taodongl · 2023-06-21T01:58:24Z

@wangtao9 @parrt

static thread_local JSONLexerStaticData *jsonlexerLexerStaticData = nullptr;

Does it lead to "memory leak", who responsible to release the memory when thread destroyed?
Similar post about thread_local : https://stackoverflow.com/questions/46429861/c-how-to-use-thread-local-to-declare-a-pointer-variable

wangtao9 · 2023-06-25T07:16:17Z

@taodongl
Great you found it! There is indeed a risk of memory leaks when threads exit prematurely. I actually fixed this a few weeks ago, see this commit wangtao9@ce6649a. I'll submit a new PR later @parrt.

I solved the problem with thread_local static std::unique_ptr, the unique_ptr is responsible for managing memory.

Now no memory leaks detected, you can verify it with this link: https://github.com/wangtao9/antlr4-perfopt-test/tree/sanitizer_check

wangtao9 added 2 commits April 17, 2023 06:47

modify cpp codegen template to generate lock-free cpp lexer&parser

7584f36

Signed-off-by: wangtao9 <wangtaofighting@163.com>

add lockFreeCppTarget option to generate lock-free Cpp lexer

436d23a

and/or parser (default OFF) Signed-off-by: wangtao9 <wangtaofighting@163.com>

KvanTTT suggested changes Apr 24, 2023

View reviewed changes

tool/src/org/antlr/v4/tool/Grammar.java Outdated Show resolved Hide resolved

tool/src/org/antlr/v4/codegen/model/Recognizer.java Outdated Show resolved Hide resolved

wangtao9 added 2 commits April 25, 2023 09:44

fix boolean option -lock-free-cpp-target

a543fa2

Signed-off-by: wangtao9 <wangtaofighting@163.com>

remove string option lockFreeCppTarget of grammar

cf08db7

Signed-off-by: wangtao9 <wangtaofighting@163.com>

KvanTTT suggested changes Apr 25, 2023

View reviewed changes

tool/resources/org/antlr/v4/tool/templates/codegen/Cpp/Cpp.stg Outdated Show resolved Hide resolved

tool/resources/org/antlr/v4/tool/templates/codegen/Cpp/Cpp.stg Outdated Show resolved Hide resolved

wangtao9 added 2 commits April 26, 2023 08:40

use compile-time macro instead of generate-time option

07e222f

Signed-off-by: wangtao9 <wangtaofighting@163.com>

remove useless codegen option

c7eb2df

Signed-off-by: wangtao9 <wangtaofighting@163.com>

wangtao9 added 2 commits May 4, 2023 03:23

update doc cpp-target.md

5a89776

Signed-off-by: wangtao9 <wangtaofighting@163.com>

minor fix doc

4b835b8

Signed-off-by: wangtao9 <wangtaofighting@163.com>

mike-lischke approved these changes May 4, 2023

View reviewed changes

doc/cpp-target.md Outdated Show resolved Hide resolved

fix review comments

e2c94d1

Signed-off-by: wangtao9 <wangtaofighting@163.com>

mike-lischke approved these changes May 4, 2023

View reviewed changes

KvanTTT reviewed May 4, 2023

View reviewed changes

doc/cpp-target.md Outdated Show resolved Hide resolved

update doc/cpp-target.md

174e433

Co-authored-by: Ivan Kochurkin <kvanttt@gmail.com> Signed-off-by: Tao Wang <wangtaofighting@163.com>

KvanTTT approved these changes May 5, 2023

View reviewed changes

parrt added target:cpp comp:performance threading labels May 6, 2023

parrt added this to the 4.12.1 milestone May 6, 2023

parrt merged commit aed321c into antlr:dev May 6, 2023
45 checks passed

hzeller mentioned this pull request May 15, 2023

cpp runtime: Remove pthread dependency. #4267

Closed

mykaul mentioned this pull request Jun 20, 2023

Ugly cql syntax error messages scylladb/scylladb#1703

Open

wangtao9 mentioned this pull request Jun 25, 2023

Fix Memory Leak Risk of Cpp Target #4330

Open

andreasbuhr mentioned this pull request Apr 11, 2024

Performance Issues Running Cpp Runtime on Many Threads #3938

Open

Optimize the concurrent performance of Cpp target by more than 10 times #4237

Optimize the concurrent performance of Cpp target by more than 10 times #4237

Conversation

wangtao9 commented Apr 18, 2023 • edited

KvanTTT commented Apr 18, 2023

jimidle commented Apr 18, 2023

wangtao9 commented Apr 19, 2023

jimidle commented Apr 19, 2023 via email

wangtao9 commented Apr 24, 2023

parrt commented Apr 24, 2023

parrt commented Apr 24, 2023

hzeller commented Apr 24, 2023 • edited

hzeller commented Apr 24, 2023

wangtao9 commented Apr 25, 2023

wangtao9 commented Apr 25, 2023

hzeller commented Apr 25, 2023

wangtao9 commented Apr 25, 2023

ericvergnaud commented Apr 25, 2023

wangtao9 commented Apr 25, 2023

hzeller commented Apr 25, 2023

ericvergnaud commented Apr 25, 2023 via email

KvanTTT left a comment • edited

Choose a reason for hiding this comment

jimidle commented Apr 26, 2023 via email

wangtao9 commented Apr 26, 2023 • edited

wangtao9 commented Apr 26, 2023 • edited

wangtao9 commented Apr 26, 2023

jimidle commented Apr 26, 2023 via email

mike-lischke commented May 1, 2023

mike-lischke commented May 1, 2023

parrt commented May 1, 2023

wangtao9 commented May 4, 2023

wangtao9 commented May 4, 2023 • edited

wangtao9 commented May 4, 2023

wangtao9 commented May 4, 2023

mike-lischke commented May 4, 2023 • edited

mike-lischke left a comment

Choose a reason for hiding this comment

wangtao9 commented May 4, 2023

KvanTTT left a comment

Choose a reason for hiding this comment

KvanTTT left a comment

Choose a reason for hiding this comment

wangtao9 commented May 6, 2023

parrt commented May 6, 2023

taodongl commented Jun 21, 2023

wangtao9 commented Jun 25, 2023

wangtao9 commented Apr 18, 2023 •

edited

hzeller commented Apr 24, 2023 •

edited

KvanTTT left a comment •

edited

wangtao9 commented Apr 26, 2023 •

edited

wangtao9 commented Apr 26, 2023 •

edited

wangtao9 commented May 4, 2023 •

edited

mike-lischke commented May 4, 2023 •

edited