Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 linux: torch.compile performance is 2x slow with nightly torch wheel compared to the wheel built with 'build_aarch64_wheel.py' script #1774

Closed
snadampal opened this issue Apr 5, 2024 · 18 comments · Fixed by #1803 · May be fixed by #1781

Comments

@snadampal
Copy link
Contributor

For torchbench benchmarks with dynamo backend, the aarch64 linux nightly wheel performance is 2x slow compared to the wheel I've built using the pytorch/builder/build_aarch64_wheel.py script for the same pytorch commit.

The difference seems to be coming from
the https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_ci_build.sh used for nightly builds. I suspect it's with the libomp.

How to reproduce?

git clone https://github.com/pytorch/benchmark.git
cd benchmark

# apply this PR: https://github.com/pytorch/benchmark/pull/2187

# setting omp threads =16, because i'm using c7g.4xl instance

OMP_NUM_THREADS=16 python3 run_benchmark.py cpu --model hf_DistilBert --test eval --torchdynamo inductor --freeze_prepack_weights --metrics="latencies,cpu_peak_mem"
@snadampal
Copy link
Contributor Author

it's indeed the libomp.so from conda in the nightly wheel is causing the issue. I replaced it with debian libgomp, the performance is restored, 2x improved.
I'm checking how I can switch to libgomp for the aarch64 nightly and release wheel building.

@atalman
Copy link
Contributor

atalman commented Apr 5, 2024

cc @malfet

@malfet
Copy link
Contributor

malfet commented Apr 5, 2024

@snadampal so, should we package libomp from debian in our build scripts?

@bryantbiggs
Copy link
Contributor

I would love to help get away from these conda packaged deps and instead use something more OS native (i.e. - build in a container using whats provided via apt, yum, dnf, etc.)

@snadampal
Copy link
Contributor Author

The issue is observed for the pytorch 2.3 release candidate wheels as well.
pip3 install torch==2.3.0 --index-url https://download.pytorch.org/whl/test/cpu

In the release wheel package, I see we are already packaging both the omp libraries.

libgomp-0f9e2209.so.1.0.0 --> from debian
libomp-b8e5bcfb.so. --> from conda

but looks like all the libraries are linked to libomp-b8e5bcfb.so. so, it might be more than packaging, we will have to switch the dependent libraries too.
for experimenting, I renamed libgomp-0f9e2209.so.1.0.0to libomp-b8e5bcfb.so and saw the performance restored.

@malfet
Copy link
Contributor

malfet commented Apr 5, 2024

@bryantbiggs I think this is a general direction build process is going towards: no Anaconda, just use what's in pypa docker

@snadampal
Copy link
Contributor Author

I will check if the new scripts can be updated to remove conda dependency.

Otherwise we anyway have the fallback option: The old scripts https://github.com/pytorch/builder/blob/main/aarch64_linux/aarch64_wheel_ci_build.py are native manylinx os builds. they are being maintained, so, we can switch to them for the CD.

@snadampal
Copy link
Contributor Author

one small correction to my previous statement. we are packaging only the coda libomp.so for the nightly and release wheels, not both conda and manylinux.

In the release wheel package, I see we are already packaging both the omp libraries.

coming to the solution, it's the same.
if we replace libomp.so with libgomp.so, the performance is back , but I don't think it's a good direction just to replace omp library with manylinux but keeping rest from conda, they might cause problems somewhere else.
so, I'm planning to switch everything to manylinux and pip.

@snadampal
Copy link
Contributor Author

looking at how the wheel building scripts are integrated into nightly wheel workflow, everything is happening inside the manylinux docker. and the docker is missing many packages including openblas.
btw, even the older scripts are using conda for those missing packages, but only difference is, they are embedding the libgomp explicitly from host, instead of the one from conda.
as @malfet hinted, packaging libgomp into the wheel might be the only way for now.

any other thoughts?

@malfet
Copy link
Contributor

malfet commented Apr 5, 2024

@snadampal In original script I build OpenBLAS from source, because one that comes with OS was lacking OpenMP integration. And the only reason to use conda were to install cmake and ninja that were missing in PyPI at the time. Now one can(and should) completely eliminate Conda dependency for wheel builds

@snadampal
Copy link
Contributor Author

Right now the scripts are using manylinux2014 os which is pretty old, comes with python 3.6, that's why we had to rely on conda for lot of packages. Given the EOL for it is June 30th, 2024. (link), I'm upgrading the docker to manylinux_2_28, the latest one. this might remove conda dependency. I'm trying it.

@snadampal
Copy link
Contributor Author

it's becoming more involved than I initially thought. manylinux 2_28 comes with gcc-12 with which pytorch compilation is failing on aarch64.
@bryantbiggs , would you be able to give it a try?

@bryantbiggs
Copy link
Contributor

ya, I'll see if I can take a look - still working my through building from source without conda for CUDA based build

@snadampal
Copy link
Contributor Author

snadampal commented Apr 8, 2024

I have upgraded the docker to manylinux 2_28 and removed conda dependency completely, everything installed from manylinux or pypi. This solves the libomp performance issues.

here is the draft PR:
#1781

I had to disable pytorch tests building, via BUILD_TEST=0 to make it work, that's where GCC-12 breaks coming from, I will look into it next.

@snadampal
Copy link
Contributor Author

I have fixed the pytorch test build issue, in fact it seems to be a known issue pytorch/pytorch#99278, and there was a PR too pytorch/pytorch#99468.
Looks like it was observed on x86 as well, so, it would be good if the PR can be merged.

With this PR, torch build is working fine in manylinux 2_28 docker with gcc12 toolchain.

@snadampal
Copy link
Contributor Author

For now, I'm using gcc-11 toolchain on manylinux2_28, so, I'm not blocked on PyTorch test build PR mentioned above.

@snadampal
Copy link
Contributor Author

I'm observing that compared to the default llvm libomp used in official torch wheels, gnu libgomp is giving 2x better perf for torch compile mode, but 10% slower perf for eager mode.
Unless this 10% slowness is root-caused, it's not good to switch to libgomp. I'm checking why the 10% drop, more importantly why libomp is 2x slower for torch.compile, is there any mixup in torch dynamo?

@snadampal
Copy link
Contributor Author

PyTorch_HuggingFace_Inference_aarch64_Linux

Looks like there is no clear winner for libgomp vs libompon aarch64 linux. We ran several benchmarks from huggingface, and plotted libgomp perf as the reference blue line, and the relative libomp perf as orange bars. From the chart, it's clear that while libomp is better for few models, libgomp is better in general, so, we are going ahead with switching to libgomp for aarch64 linux, which has already shown upto 2x better perf for torch.compile.

@malfet malfet closed this as completed in b57d3a8 Apr 24, 2024
malfet pushed a commit that referenced this issue Apr 25, 2024
In the current version of the scripts, torch libraries are linked to llvm openmp becasue conda openblas-openmp is linked to it. to switch to gnu libgomp, we are building the openblas from sources instead of installing from conda.

In essence it reverts #1462

fixes #1774

(cherry picked from commit b57d3a8)
snadampal added a commit to snadampal/builder that referenced this issue Apr 28, 2024
In the current version of the CD scripts, torch libraries are
linked to llvm openmp because conda openblas-openmp is linked to it.
To switch to gnu libgomp, we are building the openblas from sources
instead of installing from conda.
Building openBLAS shared library instead of static library to
be able to discover LAPACK support in OpenBLAS.

cherrypicked from pytorch#1803
fixes: pytorch#1774
atalman pushed a commit that referenced this issue May 8, 2024
In the current version of the CD scripts, torch libraries are
linked to llvm openmp because conda openblas-openmp is linked to it.
To switch to gnu libgomp, we are building the openblas from sources
instead of installing from conda.
Building openBLAS shared library instead of static library to
be able to discover LAPACK support in OpenBLAS.

cherrypicked from #1803
fixes: #1774
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants