Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the concurrent performance of Cpp target by more than 10 times #4237

Merged
merged 10 commits into from
May 6, 2023

Conversation

wangtao9
Copy link
Contributor

@wangtao9 wangtao9 commented Apr 18, 2023

Usage:
add -lock-free-cpp-target option when generating parser, e.g.
java -jar ${ANTLR_JAR} -Dlanguage=Cpp -lock-free-cpp-target Cypher.g4
add compile option -DANTLR4_USE_THREAD_LOCAL_CACHE=1 when compiling the Cpp lexer & parser.

Related issues:
#2454
#2584
#3938
Why the C++ target is 6X slower than the Java target

Optimization result:
image

Test configuration:
Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz ; Cores: 16 ; Logical processors: 32
256GB memory
grammar file: https://s3.amazonaws.com/artifacts.opencypher.org/M21/Cypher.g4
test query:

    MATCH (n:Person {id:2000})-[:knows]->(friend)-[:located_in]->(city)
    RETURN friend, count(city) ORDER BY friend.id LIMIT 100

Signed-off-by: wangtao9 <wangtaofighting@163.com>
 and/or parser (default OFF)

Signed-off-by: wangtao9 <wangtaofighting@163.com>
@KvanTTT
Copy link
Member

KvanTTT commented Apr 18, 2023

This optimization looks great.

@jimidle
Copy link
Collaborator

jimidle commented Apr 18, 2023

I have done a similar build config for go. It does make a difference. I didn’t get 10x from go, but there could be all sorts of reasons for that.

Is your input example all you tried? Your mileage may vary on different input. I’ll try the same thing with go as well.

It’s 22:45 where I live, so I’ll try tomorrow

@wangtao9
Copy link
Contributor Author

I have done a similar build config for go. It does make a difference. I didn’t get 10x from go, but there could be all sorts of reasons for that.

Is your input example all you tried? Your mileage may vary on different input. I’ll try the same thing with go as well.

It’s 22:45 where I live, so I’ll try tomorrow

I've also tried simpler inputs such as "RETURN 1" and more complex examples of over 800 characters, both with significant improvements.

But more importantly, this optimization can achieve a huge performance improvement from a mechanical point of view. C++'s runtime handles locks differently from JVM, and it will fall into kernel calls more frequently, which is one of the main reasons why the concurrency performance of c++ target is much slower than that of java.

@jimidle
Copy link
Collaborator

jimidle commented Apr 19, 2023 via email

@wangtao9
Copy link
Contributor Author

@parrt Can this PR be merged?

tool/src/org/antlr/v4/tool/Grammar.java Outdated Show resolved Hide resolved
tool/src/org/antlr/v4/codegen/model/Recognizer.java Outdated Show resolved Hide resolved
@parrt
Copy link
Member

parrt commented Apr 24, 2023

Perhaps @hzeller has an opinion here, but frankly I'm terrified of Multi threaded version of the parsing strategy... It is incredibly tricky to get right and there are many people that rely on the C++ runtime.

@parrt
Copy link
Member

parrt commented Apr 24, 2023

OK I just looked at the code. Adding threads you are simply removing a lock I correct? I guess the question is how does it work without the lock in a multithreaded environment?

@hzeller
Copy link
Contributor

hzeller commented Apr 24, 2023

I had a brief look - it changes one global variable with multiple thread_local ones. I don't know exactly the call sequence of the cpp runtime, so I don't know what happens then: it seems like now initialize() is called unconditionally every time instead of once which would be strange, but I think I first have to look exactly what that means by looking at the generated code. Maybe this evening.

Personally, I would anyway avoid doing something like

static Foo *foo = nullptr;
call_once(initialize_foo);

But more something like

  static Foo *foo = new Foo();

Then the memory model takes care of initializing that static field exactly once and possibly be more efficient than any call_once implementation, which was more a thing needed before the c++11 clarification of the memory model.

So I would create functions that create the static object and return the pointer, and then call it in such pattern:

  static StaticData *LexerStaticData = CreateLexerStaticData();  // or whatever that template expands to.

I'd make that unconditionally, don't add a define lockFreeCppTarget but always do it this way. We then can also remove code that provides call_once(), as it is only needed there.

I know @jcking was looking at multi-threaded performance, maybe he has come across this part of the code and maybe has some recollection if/why call_once() was needed ? I suspect it was some pre c++11 requirement.

@hzeller
Copy link
Contributor

hzeller commented Apr 24, 2023

@jcking changed the once implementation to be either a local one or an absl one in this change, but it was not changing the call_once() need per-se. I suspect he did that because things are faster with absl.

With suggestions in my previous comment we can eliminate the pre-c++11 need to call_once() entirely and just do the static initialization.

@wangtao9
Copy link
Contributor Author

OK I just looked at the code. Adding threads you are simply removing a lock I correct? I guess the question is how does it work without the lock in a multithreaded environment?

@parrt Not exactly. It does not simply remove a lock, but changes the static data shared by multi-threads into one copy for each thread. Since the data is owned by each thread, locks are no longer relied upon to keep the data safe (although locks still exist).

@wangtao9
Copy link
Contributor Author

@hzeller
"eliminate the pre-c++11 need to call_once()" is a good thing, but it's not directly related to what this optimization does.

The idea of this optimization is to turn the static data shared by multi-threads into thread-owned, so as to avoid competition for locks.

<lexer.name>::initialize()/<parser.name>::initialize() is actually only called once in the constructor (called once per thread), so this optimization does not require call_once() or the equivalent semantics of C++11.

@hzeller
Copy link
Contributor

hzeller commented Apr 25, 2023

Is the data structure modified in each thread ?

If not, we don't need locks, and can make the data structure const to make sure nobody does that in the future.

But if so, then it being static sounds like a bad idea. Thread local will fix that particular situation to not require locks then but it also means that there is something else going on and changing it to thread local wil change the semantics as now every thread sees a different content.

@wangtao9
Copy link
Contributor Author

@hzeller
The static data (mainly DFA, ATN, etc.) is constructed during the parse process, and will not be modified after the construction is completed. Most importantly, the above data structures constructed by multi-thread and single-thread are completely consistent.

I also verified this with experiments, as shown in the figure below, the left is the log of building DFA states with a single thread, and the right side is 4 threads doing the same thing. It can be seen that except for the different thread ids, the constructed DFA states are exactly the same.
image

@ericvergnaud
Copy link
Contributor

So basically you're sacrificing reuse to avoid locks? I suspect this might be counter beneficial in terms if performance with complex grammars because each thread needs to rebuild the complete DFA instead of it being built just once? Is it possible that the locks that currently protect concurrent updates of the DFA are protecting too much code ?

Signed-off-by: wangtao9 <wangtaofighting@163.com>
Signed-off-by: wangtao9 <wangtaofighting@163.com>
@wangtao9
Copy link
Contributor Author

@ericvergnaud I think what you say makes sense, it could happen.
However, in my usage scenario (ultra-high concurrent requests), I care more about the scalability of multi-core performance, and this optimization is necessary. And in several scenarios I tested, it is effective.
To be on the safe side, I added an option -lock-free-cpp-target and turned off this optimization by default.

@hzeller
Copy link
Contributor

hzeller commented Apr 25, 2023

So if the state is constructed once and then never modified, then the object can be const right ? If so, there is no need for any locks (and thus thread local).
So the fact that this apparently makes a difference sounds like the state is indeed modified in the various threads, so that there need to be locks ?

@ericvergnaud
Copy link
Contributor

ericvergnaud commented Apr 25, 2023 via email

Copy link
Member

@KvanTTT KvanTTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it once again and now I'm don't think introducing the new option is a good idea. Why this runtime option can't be activated in runtime? It's more flexible solution since it doesn't require regeneration. At least, in C++ it's possible to use preprocessor directives:

#if USE_THREAD_LOCAL_CACHE
static thread_local
#endif
<lexer.name; format = "cap">StaticData *<lexer.grammarName; format = "lower">LexerStaticData = nullptr;

@jimidle
Copy link
Collaborator

jimidle commented Apr 26, 2023 via email

@wangtao9
Copy link
Contributor Author

wangtao9 commented Apr 26, 2023

@hzeller In C++, these data cannot be simply declared const, because we have no way to determine their values during the declaration phase, they are assigned during the parse process.

But I think there is a way (with major changes) to completely remove the use of locks. That would be a big upgrade, and similar optimization effect could be achieved.

@wangtao9
Copy link
Contributor Author

wangtao9 commented Apr 26, 2023

@KvanTTT I also support this modification, changing the generate-time option to compile-time.
I will make it happen.

updated:
I have done this modification and named the option ANTLR4_USE_THREAD_LOCAL_CACHE to avoid C++ macro conflict.

@wangtao9
Copy link
Contributor Author

@jimidle What is the multithreading speedup you measured in Go?

I suppose that the scalability should be much better than Cpp target (the speedup ratio of 32 threads compared to single thread is only 1.34)

@jimidle
Copy link
Collaborator

jimidle commented Apr 26, 2023 via email

Signed-off-by: wangtao9 <wangtaofighting@163.com>
Signed-off-by: wangtao9 <wangtaofighting@163.com>
@mike-lischke
Copy link
Member

Not sharing the DFA means each thread has to do the warmup phase on its own. What exactly does this approach improve? Keep in mind that the locks are only used while building up the DFA. It doesn't affect normal runtime behavior once this is done (well, actually, if new input comes up that wasn't parsed before there can still be modifications, but this has a very low overhead, compared to the initial warmup). Additionally, making the DFA thread local can have a serious memory impact.

To me the attempts to improve that part of the code seem to be efforts spent in the wrong spots. It's not the (very few) locks there that are only used while building up the DFA (whose impact could be lowered by running a parser in a single thread and prime it with typical input, until the DFA no longer grows significantly and after that start the other threads). Instead work should be spent in the adaptivePredict part, which has a much higher overall impact. First convert the recursion to iterations to avoid deep call stacks (which can also become a serious problem in threads, where you cannot set a bigger stack size) and then remove the need of std::shared_ptr. Manage the pointers in a different way and work only with raw pointers in this hot path. Maybe an own shared pointer class would be useful here, which does not have to be thread safe and hence doesn't use locks?

@mike-lischke
Copy link
Member

One thing that would be a potential problem on C++ but not on Java/C# is the latter allows us to compute a value multiple times and then quietly discard all but one of them. The C++ target would be responsible for determining which one among multiple values actually ended up in the cache, and releasing the memory from all the remaining ones.

@sharwell, do you refer to the fact that sometimes references are replaced (e.g. during optimization/merge runs in PredictionContext)? I think this is handled properly by the shared pointers (and is one of the main reasons why shared pointers are used at all).

@parrt
Copy link
Member

parrt commented May 1, 2023

@wangtao9 are you measuring DFA warm up time here or are you measuring throughput after the parser is warmed up. Or both together? I would bet that after warm up the existing system gets higher throughput with multiple threads.

@wangtao9
Copy link
Contributor Author

wangtao9 commented May 4, 2023

@mike-lischke

Keep in mind that the locks are only used while building up the DFA. It doesn't affect normal runtime behavior once this is done

Most of what you say is correct, but I think this key point is not quite right. The locks are not "only used while building up the DFA", after the built is done, the read lock is used for EVERY read, resulting in poor concurrent performance of the C++ target (the speedup ratio of 32 threads is less than 2).
That's why I submitted this PR.

@wangtao9
Copy link
Contributor Author

wangtao9 commented May 4, 2023

@parrt
I measured one warmup plus many parses. Although not specific measured, warmup should be time-consuming. But warmup is a one-time job (whether it is all threads call once(before this optimization), or each thread call once(after this optimization)), as @mike-lischke mentioned, the subsequent incremental warmup "has a very low overhead, compared to the initial warmup".

@wangtao9
Copy link
Contributor Author

wangtao9 commented May 4, 2023

So if there is no correctness issue, then this LGTM.

The choice comes with more memory consumption and redundant computation, but that is a tradeoff the user can make. And for the single-threaded case, there is no disadvantage.

@hzeller Totally agree, exactly what I meant.

Signed-off-by: wangtao9 <wangtaofighting@163.com>
Signed-off-by: wangtao9 <wangtaofighting@163.com>
@wangtao9
Copy link
Contributor Author

wangtao9 commented May 4, 2023

Also, please add information about the new directive to the doc (cpp-target.md).

@KvanTTT Done.

@mike-lischke
Copy link
Member

mike-lischke commented May 4, 2023

@wangtao9 OK, thanks for taking care to create such a patch, after analysing the code thoroughly. I also believe that the C++ runtime could benefit very much from removing locks (including shared_ptr). So I'm fine with your PR.

Copy link
Member

@mike-lischke mike-lischke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fix that little typo.

doc/cpp-target.md Outdated Show resolved Hide resolved
Signed-off-by: wangtao9 <wangtaofighting@163.com>
@wangtao9
Copy link
Contributor Author

wangtao9 commented May 4, 2023

@mike-lischke Thanks for your review. Like you I also think that removing locks is worth doing, maybe it can be put on the agenda in the near future? Until this big action is done, C++ runtime users can get similar benefits from this optimization :D

Copy link
Member

@KvanTTT KvanTTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's OK for me but please also fix minor issues with cpp-target.md.

doc/cpp-target.md Outdated Show resolved Hide resolved
Co-authored-by: Ivan Kochurkin <kvanttt@gmail.com>
Signed-off-by: Tao Wang <wangtaofighting@163.com>
Copy link
Member

@KvanTTT KvanTTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

@wangtao9
Copy link
Contributor Author

wangtao9 commented May 6, 2023

Glad to contribute to antlr4 project! :D

@parrt parrt added this to the 4.12.1 milestone May 6, 2023
@parrt parrt merged commit aed321c into antlr:dev May 6, 2023
45 checks passed
@parrt
Copy link
Member

parrt commented May 6, 2023

Thanks, everyone, especially @wangtao9 !

@taodongl
Copy link

@wangtao9 @parrt

static thread_local JSONLexerStaticData *jsonlexerLexerStaticData = nullptr;

Does it lead to "memory leak", who responsible to release the memory when thread destroyed?
Similar post about thread_local : https://stackoverflow.com/questions/46429861/c-how-to-use-thread-local-to-declare-a-pointer-variable

@wangtao9
Copy link
Contributor Author

@taodongl
Great you found it! There is indeed a risk of memory leaks when threads exit prematurely. I actually fixed this a few weeks ago, see this commit wangtao9@ce6649a. I'll submit a new PR later @parrt.

I solved the problem with thread_local static std::unique_ptr, the unique_ptr is responsible for managing memory.

Now no memory leaks detected, you can verify it with this link: https://github.com/wangtao9/antlr4-perfopt-test/tree/sanitizer_check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants