Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gain more vectorization opportunities #9570

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dlee992
Copy link
Contributor

@dlee992 dlee992 commented May 10, 2024

This PR includes the 2nd idea, which comes from Discourse: Compilation pipeline, compile time and vectorization

Just want to let numba CI test this.

In some cases, this PR can bring more vectorization chances and give users a better performance.
In my personal test case, the perf benefit is 10~20%.

Known issues:

  • It doesn't work well with cache=True, error msg: 'module already added to this engine'
  • After setting cache=False, error msg: Symbol _ZN5numba2np8arrayobj15_call_allocator... not linked properly'

Unknown issues:

  • It removes _mpm_cheap in Codegen, does it have some unknown effect?
  • It removes some specific optimisation passes added for LLVM 11 and 14. Does it have a thing?

@aseyboldt
Copy link
Contributor

Nice :-)
You can also try #9566, which also already contains something along these lines.

I think though that that could be much cleaner when numba/llvmlite#1046 is available.
I think the following would then make sense?

  • Run the SimplifyModule pipeline on each _final_module in isolation (so before other modules are linked in). I think the refpune pass can be included into that, maybe using this extension point?
  • Link all the required libraries into the _compiled_module
  • Run the default pipeline (again with the refpune pass) after linking in the required libraries.

That way, no destructive passes should be executed before linking, we only ever link each module into the final module once, and we still share some optimization passes if a function is used several times.

Alternatively, and closer to the current behavior, the requirered libraries could be linked in before the Simplification pass. That would mean that the Simplification can do more work, but it also means that sometimes a module might be linked in and optimized multiple times.

@dlee992
Copy link
Contributor Author

dlee992 commented May 13, 2024

Thanks for your input @aseyboldt!

I didn't notice #9566 has already included most code changes in this PR, perhaps I should close this one, in favour of that.

Will try to understand #9566 and give my feedback in that thread ASAP!

@twiecki
Copy link

twiecki commented May 14, 2024

@dlee992 But #9566 was not merged so I don't think this should be closed.

@dlee992
Copy link
Contributor Author

dlee992 commented May 14, 2024

Hi, thanks for your asking.

My concerns are:

  1. Prevent compilation when no_cpython_wrapper is set and restructure linking IR modules #9566 contains some subtential changes in codegen.py, as this PR is. And that PR fundamentally added a new compilation status and changed some optimization/linking orders. I feel it's hard to push forward these 2 draft PRs at the same time;
  2. more importantly, I didn't have a clue to fix these known and unknown issues listed above; and as aseyboldt's comment, this PR also needs some changes from llvmlite side, ex: a new llvm pass manager, which can't be included in this PR.
  3. And for reviewers from core devs, I believe they also don't want to review 2 PRs about codegen.py changes at the same time, given their kinda full schedule of migrating NumPy, CPython, CUDA, etc.

But I can reopen it for visibility.

@dlee992 dlee992 reopened this May 14, 2024
@dlee992 dlee992 changed the title [investigate] Missed vectorization opportunities [investigate] Missed vectorization opportunities by skipping _mpm_cheap May 14, 2024
@dlee992 dlee992 changed the title [investigate] Missed vectorization opportunities by skipping _mpm_cheap [investigate] Gain vectorization opportunities by skipping _mpm_cheap May 14, 2024
@dlee992
Copy link
Contributor Author

dlee992 commented May 14, 2024

I believe this PR belongs to a trial of this meta-issue: #8430.

@dlee992 dlee992 changed the title [investigate] Gain vectorization opportunities by skipping _mpm_cheap [investigate] Gain vectorization opportunities May 15, 2024
@dlee992 dlee992 marked this pull request as ready for review May 15, 2024 16:37
@dlee992 dlee992 changed the title [investigate] Gain vectorization opportunities Gain more vectorization opportunities May 15, 2024
@dlee992
Copy link
Contributor Author

dlee992 commented May 15, 2024

Looks like all CI is failing since new llvmlite release.

ERROR: No matching distribution found for llvmlite<0.45,>=0.44.0dev0

@dlee992
Copy link
Contributor Author

dlee992 commented May 15, 2024

I refactored the original commit from @aseyboldt to only these 9-line changes. In my internal benchmark,

  • I can still get the perf benefit, ~10% - 20%,
  • The side effect is it increases compilation time (e.g., 2x),
  • perhaps it can also use more memory, untested.

I should say more about my internal benchmark, which is also removing all nrt refcount operations, so it could not be representative.

I can guard these 2 optimization tactics with numba config, e.g., NUMBA_EXTRA_OPTIMIZATION. It will warn users that you perhaps can get some performance back, but will also increase the compilation time (and memory usage?)

BTW, the first tactic is dominant, i.e., change fn.linkage.

If CI can run without the llvmlite issue, it would be nice, I want to see if current change breaks anything else. gentle tag @guilhermeleobas.

@guilhermeleobas
Copy link
Collaborator

I vaguely remember a discussion around the topic of running the LLVM optimizer more than once in Numba. Running more than once will lead to better code? Maybe, but we pay the price on compile-time. If Numba should do this or not is an open question. I'll try to find the issue where something around these lines were discussed.

@dlee992
Copy link
Contributor Author

dlee992 commented May 15, 2024

Yeah, I believe many discussions happened before.
But a big difference compared to previous optimization discussion is that the dominant thing in this PR is

        # 1. change fn.linkage
        for fn in self._final_module.functions:
            if fn.linkage == ll.Linkage.linkonce_odr:
                fn.linkage = "internal"

changing the function linkage from linkonce_odr to internal in the final module. I don't know what this means in LLVM side, but it's the magic here.
I need to try comparing the LLVM IR emitted from before and after.

@aseyboldt
Copy link
Contributor

Setting the linkage type to internal doesn't sound like it should be correct to me. Currently (not in #9566 by the way) modules are linked greedily, and this means that sometimes the same symbols get linked into a module multiple times (if the linkage graph has diamond patterns). linkonce_ode promises to llvm that those definitions will then be equivalent (see here).

And about running things several times: In the new pass manager some pipelines explicitly say if that is ok or not, so using that would make things much clearer I think. The current linking scheme often runs the optimizer multiple times on the same code, and I think that can have a detrimental effect on the performance of the final executable. That was one of the reasons for reducing the number of optimization runs in #9566. It can't get rid of multiple runs entirely, because of the refpuning pass, but in the new pass manager I think that pass could be embedded in one or several reasonable extension points of the existing pipeline, so that hopefully we wouldn't need to run more than one pipeline at all. (But simplification passes might still make sense to reduce duplicate work and speed up compilation if a function is used multiple times).

@dlee992
Copy link
Contributor Author

dlee992 commented May 15, 2024

Yeah, funny story: if I ran the test for outer_logp_wrapper in the original discourse, with the latest commit in this PR, I can't get perf benefit. Then I guess the latest commit only has good benefit for my personal test case.

We definitely need some benchmarks for further discussion. I did find more vector instruction for my personal test case, e.g., more <size x type> in its LLVM IR. Maybe I should make up one from the case.

Updates: if only use the first commit a2d02e44d, I still can't get perf benefit for outer_logp_wrapper. not sure if I did sth wrong.

@dlee992
Copy link
Contributor Author

dlee992 commented May 16, 2024

I have one more detail to share: I found my personal test case works since I did 2 things together:

  1. apply this PR (only changing linkage is enough for me; the rerun of _optimize_functions is more like a safety net)
  2. remove all NRT_incref and NRT_decref call instructions in the LLVM IR generated by numba (yeah, danger!)

IF I only apply this PR without the removal, some cases can get benefits, while other cases become slower...
Only using 2nd can gain me ~30% runtime benefit; while adding this PR, I can gain extra ~5-10% "stably".

BTW, I wonder if Numba wants to have the 2nd feature. ex: adding a numba config, NUMBA_LLVM_REF_DELETION, when enabling this flag, we can remove all NRT_incref and NRT_decref call instructions in the LLVM IR. I know it's dangerous for general cases, but it could be very useful in runtime when you exactly know what you are doing. cc @sklam

@dlee992
Copy link
Contributor Author

dlee992 commented May 21, 2024

Just for fun:

After testing with other linkage type, I found changing linkonce_odr to external, it can also get perf benefits. Depending on testing workloads, external could be better/worse than internal in different cases. But external is "always" better than linkonce_odr, and basically didn't affect compilation time.

for fn in self._final_module.functions:
    if fn.linkage == ll.Linkage.linkonce_odr:
        fn.linkage = "external"

Although I can't explain, but someone familiar with linking and linkage types could continue to dig the root cause in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants