-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gain more vectorization opportunities #9570
base: main
Are you sure you want to change the base?
Conversation
Nice :-) I think though that that could be much cleaner when numba/llvmlite#1046 is available.
That way, no destructive passes should be executed before linking, we only ever link each module into the final module once, and we still share some optimization passes if a function is used several times. Alternatively, and closer to the current behavior, the requirered libraries could be linked in before the Simplification pass. That would mean that the Simplification can do more work, but it also means that sometimes a module might be linked in and optimized multiple times. |
Thanks for your input @aseyboldt! I didn't notice #9566 has already included most code changes in this PR, perhaps I should close this one, in favour of that. Will try to understand #9566 and give my feedback in that thread ASAP! |
Hi, thanks for your asking. My concerns are:
But I can reopen it for visibility. |
_mpm_cheap
_mpm_cheap
_mpm_cheap
I believe this PR belongs to a trial of this meta-issue: #8430. |
_mpm_cheap
Looks like all CI is failing since new llvmlite release.
|
I refactored the original commit from @aseyboldt to only these 9-line changes. In my internal benchmark,
I should say more about my internal benchmark, which is also removing all nrt refcount operations, so it could not be representative. I can guard these 2 optimization tactics with numba config, e.g., BTW, the first tactic is dominant, i.e., If CI can run without the llvmlite issue, it would be nice, I want to see if current change breaks anything else. gentle tag @guilhermeleobas. |
cf00439
to
e61aff8
Compare
I vaguely remember a discussion around the topic of running the LLVM optimizer more than once in Numba. Running more than once will lead to better code? Maybe, but we pay the price on compile-time. If Numba should do this or not is an open question. I'll try to find the issue where something around these lines were discussed. |
Yeah, I believe many discussions happened before. # 1. change fn.linkage
for fn in self._final_module.functions:
if fn.linkage == ll.Linkage.linkonce_odr:
fn.linkage = "internal" changing the function linkage from |
Setting the linkage type to internal doesn't sound like it should be correct to me. Currently (not in #9566 by the way) modules are linked greedily, and this means that sometimes the same symbols get linked into a module multiple times (if the linkage graph has diamond patterns). And about running things several times: In the new pass manager some pipelines explicitly say if that is ok or not, so using that would make things much clearer I think. The current linking scheme often runs the optimizer multiple times on the same code, and I think that can have a detrimental effect on the performance of the final executable. That was one of the reasons for reducing the number of optimization runs in #9566. It can't get rid of multiple runs entirely, because of the refpuning pass, but in the new pass manager I think that pass could be embedded in one or several reasonable extension points of the existing pipeline, so that hopefully we wouldn't need to run more than one pipeline at all. (But simplification passes might still make sense to reduce duplicate work and speed up compilation if a function is used multiple times). |
This reverts commit ee6173a.
Yeah, funny story: if I ran the test for We definitely need some benchmarks for further discussion. I did find more vector instruction for my personal test case, e.g., more Updates: if only use the first commit |
I have one more detail to share: I found my personal test case works since I did 2 things together:
IF I only apply this PR without the removal, some cases can get benefits, while other cases become slower... BTW, I wonder if Numba wants to have the 2nd feature. ex: adding a numba config, |
Just for fun: After testing with other linkage type, I found changing for fn in self._final_module.functions:
if fn.linkage == ll.Linkage.linkonce_odr:
fn.linkage = "external" Although I can't explain, but someone familiar with linking and linkage types could continue to dig the root cause in the future. |
This PR includes the 2nd idea, which comes from Discourse: Compilation pipeline, compile time and vectorization
Just want to let numba CI test this.
In some cases, this PR can bring more vectorization chances and give users a better performance.
In my personal test case, the perf benefit is 10~20%.
Known issues:
cache=True
, error msg:'module already added to this engine'
cache=False
, error msg:Symbol _ZN5numba2np8arrayobj15_call_allocator... not linked properly'
Unknown issues:
_mpm_cheap
inCodegen
, does it have some unknown effect?