Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try setting MAX_JOBS=4 for oom in arm wheel #1804

Closed
wants to merge 33 commits into from

Conversation

tinglvv
Copy link
Contributor

@tinglvv tinglvv commented Apr 26, 2024

https://github.com/pytorch/pytorch/actions/runs/8840652730/job/24276381274?pr=124112 hitting OOM error in building cuda ARM wheel.
Try changing MAX_JOBS.

@tinglvv
Copy link
Contributor Author

tinglvv commented May 9, 2024

@pytorchbot rebase

Copy link
Contributor

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Contributor

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewing since we are having libopenblas.so test issues.

aarch64_linux/aarch64_wheel_ci_build.py Show resolved Hide resolved
@tinglvv
Copy link
Contributor Author

tinglvv commented May 17, 2024

@pytorchbot rebase

@nWEIdia
Copy link
Contributor

nWEIdia commented May 18, 2024

Please rebase so that the s390x errors will not show up: https://hud.pytorch.org/pytorch/pytorch/pull/126174

For the cuda test failures, we need to wait for ARM + CUDA instance availability: e.g. https://aws.amazon.com/ec2/instance-types/g5g/

cc @atalman @malfet @ptrblck

Copy link
Contributor

@nWEIdia nWEIdia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase is needed to fix some ibm errors.
Otherwise, looks great!

@tinglvv
Copy link
Contributor Author

tinglvv commented May 19, 2024

Please rebase so that the s390x errors will not show up: https://hud.pytorch.org/pytorch/pytorch/pull/126174

For the cuda test failures, we need to wait for ARM + CUDA instance availability: e.g. https://aws.amazon.com/ec2/instance-types/g5g/

cc @atalman @malfet @ptrblck

Thanks for reviewing. I think we need the SBSA nvidia driver 550.54.15 to be uploaded to AWS instead of the instance availability. I started https://github.com/pytorch/test-infra/pull/5218/files to be merged once we upload the sbsa nvidia driver runfile to https://s3.amazonaws.com/ossci-linux/nvidia_driver/.

@tinglvv
Copy link
Contributor Author

tinglvv commented May 19, 2024

@pytorchbot rebase

@nWEIdia
Copy link
Contributor

nWEIdia commented May 19, 2024

Please rebase so that the s390x errors will not show up: https://hud.pytorch.org/pytorch/pytorch/pull/126174
For the cuda test failures, we need to wait for ARM + CUDA instance availability: e.g. https://aws.amazon.com/ec2/instance-types/g5g/
cc @atalman @malfet @ptrblck

Thanks for reviewing. I think we need the SBSA nvidia driver 550.54.15 to be uploaded to AWS instead of the instance availability. I started https://github.com/pytorch/test-infra/pull/5218/files to be merged once we upload the sbsa nvidia driver runfile to https://s3.amazonaws.com/ossci-linux/nvidia_driver/.

The error message was "RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx"

Would we need to have a nvidia gpu first and then installing a driver? The M7G instance does not have an NVIDIA GPU.

@nWEIdia
Copy link
Contributor

nWEIdia commented May 19, 2024

And good catch, eventually we would need SBSA nvidia driver 550.54.15 to be uploaded to AWS for the test to work.

@nWEIdia
Copy link
Contributor

nWEIdia commented May 19, 2024

The rebase command may not work on pytorch/builder repo. A manual rebase is needed.

@tinglvv
Copy link
Contributor Author

tinglvv commented May 20, 2024

Yes we will need an ARM+CUDA instance, thanks for catching that.
Current instance-type: t4g.2xlarge, https://aws.amazon.com/ec2/instance-types/t4/ which doesnt have a GPU. We can switch once the proper builder is available.

snadampal and others added 15 commits May 19, 2024 22:58
* Disable automatic building of s390x docker image

* Update docker image and build scripts for s390x

* Switch devtoolset to 13

There is a not yet investigated build failure
caused by gcc 12, but it doesn't reproduce
with gcc 13.

* Adapt binaries check for s390x

* Switch to ubuntu:24.04 for s390x

* Update libgomp.so.1 path for s390x
* Don't deactivate/remove conda on linux

* test
* Add manylinux_2_28 image
* Manylinux 2_28 fix cmake install

* fix
@tinglvv
Copy link
Contributor Author

tinglvv commented May 20, 2024

please ignore the above commits created by rebase, will resolve these later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants