Gain more vectorization opportunities #9570

dlee992 · 2024-05-10T20:43:42Z

This PR includes the 2nd idea, which comes from Discourse: Compilation pipeline, compile time and vectorization

Just want to let numba CI test this.

In some cases, this PR can bring more vectorization chances and give users a better performance.
In my personal test case, the perf benefit is 10~20%.

Known issues:

It doesn't work well with cache=True, error msg: 'module already added to this engine'
After setting cache=False, error msg: Symbol _ZN5numba2np8arrayobj15_call_allocator... not linked properly'

Unknown issues:

It removes _mpm_cheap in Codegen, does it have some unknown effect?
It removes some specific optimisation passes added for LLVM 11 and 14. Does it have a thing?

aseyboldt · 2024-05-13T15:29:43Z

Nice :-)
You can also try #9566, which also already contains something along these lines.

I think though that that could be much cleaner when numba/llvmlite#1046 is available.
I think the following would then make sense?

Run the SimplifyModule pipeline on each _final_module in isolation (so before other modules are linked in). I think the refpune pass can be included into that, maybe using this extension point?
Link all the required libraries into the _compiled_module
Run the default pipeline (again with the refpune pass) after linking in the required libraries.

That way, no destructive passes should be executed before linking, we only ever link each module into the final module once, and we still share some optimization passes if a function is used several times.

Alternatively, and closer to the current behavior, the requirered libraries could be linked in before the Simplification pass. That would mean that the Simplification can do more work, but it also means that sometimes a module might be linked in and optimized multiple times.

dlee992 · 2024-05-13T19:35:31Z

Thanks for your input @aseyboldt!

I didn't notice #9566 has already included most code changes in this PR, perhaps I should close this one, in favour of that.

Will try to understand #9566 and give my feedback in that thread ASAP!

twiecki · 2024-05-14T11:18:58Z

@dlee992 But #9566 was not merged so I don't think this should be closed.

dlee992 · 2024-05-14T15:09:26Z

Hi, thanks for your asking.

My concerns are:

Prevent compilation when no_cpython_wrapper is set and restructure linking IR modules #9566 contains some subtential changes in codegen.py, as this PR is. And that PR fundamentally added a new compilation status and changed some optimization/linking orders. I feel it's hard to push forward these 2 draft PRs at the same time;
more importantly, I didn't have a clue to fix these known and unknown issues listed above; and as aseyboldt's comment, this PR also needs some changes from llvmlite side, ex: a new llvm pass manager, which can't be included in this PR.
And for reviewers from core devs, I believe they also don't want to review 2 PRs about codegen.py changes at the same time, given their kinda full schedule of migrating NumPy, CPython, CUDA, etc.

But I can reopen it for visibility.

dlee992 · 2024-05-14T23:14:15Z

I believe this PR belongs to a trial of this meta-issue: #8430.

dlee992 · 2024-05-15T16:38:50Z

Looks like all CI is failing since new llvmlite release.

ERROR: No matching distribution found for llvmlite<0.45,>=0.44.0dev0

dlee992 · 2024-05-15T16:50:37Z

I refactored the original commit from @aseyboldt to only these 9-line changes. In my internal benchmark,

I can still get the perf benefit, ~10% - 20%,
The side effect is it increases compilation time (e.g., 2x),
perhaps it can also use more memory, untested.

I should say more about my internal benchmark, which is also removing all nrt refcount operations, so it could not be representative.

I can guard these 2 optimization tactics with numba config, e.g., NUMBA_EXTRA_OPTIMIZATION. It will warn users that you perhaps can get some performance back, but will also increase the compilation time (and memory usage?)

BTW, the first tactic is dominant, i.e., change fn.linkage.

If CI can run without the llvmlite issue, it would be nice, I want to see if current change breaks anything else. gentle tag @guilhermeleobas.

guilhermeleobas · 2024-05-15T20:17:35Z

I vaguely remember a discussion around the topic of running the LLVM optimizer more than once in Numba. Running more than once will lead to better code? Maybe, but we pay the price on compile-time. If Numba should do this or not is an open question. I'll try to find the issue where something around these lines were discussed.

dlee992 · 2024-05-15T20:24:37Z

Yeah, I believe many discussions happened before.
But a big difference compared to previous optimization discussion is that the dominant thing in this PR is

        # 1. change fn.linkage
        for fn in self._final_module.functions:
            if fn.linkage == ll.Linkage.linkonce_odr:
                fn.linkage = "internal"

changing the function linkage from linkonce_odr to internal in the final module. I don't know what this means in LLVM side, but it's the magic here.
I need to try comparing the LLVM IR emitted from before and after.

aseyboldt · 2024-05-15T20:47:22Z

Setting the linkage type to internal doesn't sound like it should be correct to me. Currently (not in #9566 by the way) modules are linked greedily, and this means that sometimes the same symbols get linked into a module multiple times (if the linkage graph has diamond patterns). linkonce_ode promises to llvm that those definitions will then be equivalent (see here).

And about running things several times: In the new pass manager some pipelines explicitly say if that is ok or not, so using that would make things much clearer I think. The current linking scheme often runs the optimizer multiple times on the same code, and I think that can have a detrimental effect on the performance of the final executable. That was one of the reasons for reducing the number of optimization runs in #9566. It can't get rid of multiple runs entirely, because of the refpuning pass, but in the new pass manager I think that pass could be embedded in one or several reasonable extension points of the existing pipeline, so that hopefully we wouldn't need to run more than one pipeline at all. (But simplification passes might still make sense to reduce duplicate work and speed up compilation if a function is used multiple times).

This reverts commit ee6173a.

dlee992 · 2024-05-15T22:55:07Z

Yeah, funny story: if I ran the test for outer_logp_wrapper in the original discourse, with the latest commit in this PR, I can't get perf benefit. Then I guess the latest commit only has good benefit for my personal test case.

We definitely need some benchmarks for further discussion. I did find more vector instruction for my personal test case, e.g., more <size x type> in its LLVM IR. Maybe I should make up one from the case.

Updates: if only use the first commit a2d02e44d, I still can't get perf benefit for outer_logp_wrapper. not sure if I did sth wrong.

dlee992 · 2024-05-16T17:49:02Z

I have one more detail to share: I found my personal test case works since I did 2 things together:

apply this PR (only changing linkage is enough for me; the rerun of _optimize_functions is more like a safety net)
remove all NRT_incref and NRT_decref call instructions in the LLVM IR generated by numba (yeah, danger!)

IF I only apply this PR without the removal, some cases can get benefits, while other cases become slower...
Only using 2nd can gain me ~30% runtime benefit; while adding this PR, I can gain extra ~5-10% "stably".

BTW, I wonder if Numba wants to have the 2nd feature. ex: adding a numba config, NUMBA_LLVM_REF_DELETION, when enabling this flag, we can remove all NRT_incref and NRT_decref call instructions in the LLVM IR. I know it's dangerous for general cases, but it could be very useful in runtime when you exactly know what you are doing. cc @sklam

dlee992 · 2024-05-21T22:01:02Z

Just for fun:

After testing with other linkage type, I found changing linkonce_odr to external, it can also get perf benefits. Depending on testing workloads, external could be better/worse than internal in different cases. But external is "always" better than linkonce_odr, and basically didn't affect compilation time.

for fn in self._final_module.functions:
    if fn.linkage == ll.Linkage.linkonce_odr:
        fn.linkage = "external"

Although I can't explain, but someone familiar with linking and linkage types could continue to dig the root cause in the future.

guilhermeleobas added the 2 - In Progress label May 13, 2024

dlee992 closed this May 13, 2024

gmarkall mentioned this pull request May 14, 2024

Update llvmlite to be compatible with llvm-17 numba/llvmlite#1042

Open

dlee992 reopened this May 14, 2024

dlee992 changed the title ~~[investigate] Missed vectorization opportunities~~ [investigate] Missed vectorization opportunities by skipping _mpm_cheap May 14, 2024

dlee992 changed the title ~~[investigate] Missed vectorization opportunities by skipping _mpm_cheap~~ [investigate] Gain vectorization opportunities by skipping _mpm_cheap May 14, 2024

dlee992 changed the title ~~[investigate] Gain vectorization opportunities by skipping _mpm_cheap~~ [investigate] Gain vectorization opportunities May 15, 2024

dlee992 marked this pull request as ready for review May 15, 2024 16:37

dlee992 changed the title ~~[investigate] Gain vectorization opportunities~~ Gain more vectorization opportunities May 15, 2024

aseyboldt and others added 4 commits May 15, 2024 12:00

WIP Only link modules onces

a2d02e4

minimize the code diff, but doesn't change logic

2f6df2b

fix typo

7736288

narrow the core changes to 2 tactics

e61aff8

dlee992 force-pushed the vectorization_opt branch from cf00439 to e61aff8 Compare May 15, 2024 17:00

dlee992 added 2 commits May 15, 2024 16:23

remove _optimize_functions

ee6173a

Revert "remove _optimize_functions"

3ccb1cc

This reverts commit ee6173a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gain more vectorization opportunities #9570

Gain more vectorization opportunities #9570

dlee992 commented May 10, 2024 •

edited

aseyboldt commented May 13, 2024

dlee992 commented May 13, 2024

twiecki commented May 14, 2024

dlee992 commented May 14, 2024 •

edited

dlee992 commented May 14, 2024 •

edited

dlee992 commented May 15, 2024 •

edited

dlee992 commented May 15, 2024 •

edited

guilhermeleobas commented May 15, 2024

dlee992 commented May 15, 2024 •

edited

aseyboldt commented May 15, 2024

dlee992 commented May 15, 2024 •

edited

dlee992 commented May 16, 2024 •

edited

dlee992 commented May 21, 2024 •

edited

Gain more vectorization opportunities #9570

Are you sure you want to change the base?

Gain more vectorization opportunities #9570

Conversation

dlee992 commented May 10, 2024 • edited

aseyboldt commented May 13, 2024

dlee992 commented May 13, 2024

twiecki commented May 14, 2024

dlee992 commented May 14, 2024 • edited

dlee992 commented May 14, 2024 • edited

dlee992 commented May 15, 2024 • edited

dlee992 commented May 15, 2024 • edited

guilhermeleobas commented May 15, 2024

dlee992 commented May 15, 2024 • edited

aseyboldt commented May 15, 2024

dlee992 commented May 15, 2024 • edited

dlee992 commented May 16, 2024 • edited

dlee992 commented May 21, 2024 • edited

dlee992 commented May 10, 2024 •

edited

dlee992 commented May 14, 2024 •

edited

dlee992 commented May 14, 2024 •

edited

dlee992 commented May 15, 2024 •

edited

dlee992 commented May 15, 2024 •

edited

dlee992 commented May 15, 2024 •

edited

dlee992 commented May 15, 2024 •

edited

dlee992 commented May 16, 2024 •

edited

dlee992 commented May 21, 2024 •

edited