cuda-nvcc missing again #438

dhruvbalwada · 2023-02-01T19:13:44Z

It seems that the problem detected and solved in issue #387
has resurfaced again. I think this happened after #435 was merged.

The problem:

There is a ptxas based error that shows up. Can be easily reproduced as:

from jax import random
random.PRNGkey(0)

gives the error that

2023-02-01 19:08:39.849007: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:85] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2023-02-01 19:08:39.849939: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nvptx_compiler.cc:454] ptxas returned an error during compilation of ptx to sass: 'INTERNAL: Failed to launch ptxas'  If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
Aborted

During the last discussion, @ngam had asked to check what version of cuda-nvcc existed. When I check this

conda list | grep cuda-nvcc

This returns nothing, showing that there is no cuda-nvcc in the tensorflow/jax based ml-notebook.

Installing cuda-nvcc by using mamba install cuda-nvcc==11.6.* -c nvidia solves the problem.

However, it would be good if the user did not have to manually do this installation, and the docker image was properly setup.

The text was updated successfully, but these errors were encountered:

scottyhq · 2023-02-03T22:55:05Z

@dhruvbalwada I thought it was removed intentionally b/c no longer needed? See conversation here #398 ...

dhruvbalwada · 2023-02-03T22:57:44Z

Maybe @yuvipanda or @ngam or @weiji14 can chip in about why the problem has resurfaced?

ngam · 2023-02-04T06:03:08Z

It’s a complicated issue with all sorts of stuff. I think for now the best thing is to keep it out and let the user find a resolution. This is generally a tricky problem with, and mismatches are bound to happen.

The good news is that cuda-nvcc is coming to conda-forge soon; the bad news is that it’ll be a while before the lengthy migration effort concludes.

Xref:

ngam · 2023-02-04T06:04:38Z

Btw, thanks @dhruvbalwada for keeping an eye on this, and for the detailed report :)

ngam · 2023-05-15T00:22:04Z

Small update: This is finally getting resolved... hopefully very soon! xref #450

weiji14 · 2023-06-27T02:55:09Z

Looks like cuda-nvcc is now on conda-forge - https://github.com/conda-forge/cuda-nvcc-feedstock. Is it better to install in directly in the ml-notebook image, or wait for the ML libraries like Tensorflow/Jax to depend on cuda-nvcc directly first? I see some mention of it e.g. at conda-forge/tensorflow-feedstock#296 (comment).

ngam · 2023-07-05T00:22:52Z

We should likely wait. I am still trying to assess how best to migrate Jax and TensorFlow to the new packaging format. We in a bit of a bind here... with volunteer maintainers occupied with other tasks... but tensorflow 2.12 is very close and I am making small progress on jaxlib.

weiji14 · 2023-09-14T00:21:04Z

Someone reported on the forum at https://discourse.pangeo.io/t/how-to-run-code-using-gpu-on-pangeo-saying-libdevice-not-found-at-libdevice-10-bc/3672 about missing cuda-nvcc and XLA_FLAGS causing issues. Can we revisit adding cuda-nvcc to the docker image again, if the matter is resolved on conda-forge @ngam? @yuvipanda mentioned that 2i2c doesn't use the old K80 GPUs anymore, so we don't need to worry about backward compatibility if it helps.

weiji14 · 2024-05-21T21:31:04Z

Quick note to say that jaxlib-0.4.23-cuda120py* actually has an explicit runtime dependency on cuda-nvcc now (see conda-forge/jaxlib-feedstock#241), but we'll need some more updates on tensorflow to resolve an incompatibility with libabseil versions. See #549 (comment), and keep an eye on conda-forge/tensorflow-feedstock#385.

Once those PRs are merged, users shouldn't have to install cuda-nvcc manually anymore, as they should be installed directly with jaxlib.

ngam mentioned this issue May 15, 2023

cuda issues with quay.io/pangeo/ml-notebook:2023.02.27 #450

Closed

weiji14 linked a pull request May 21, 2024 that will close this issue

Pin jaxlib to use cuda120 build with cuda-nvcc dependency #549

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-nvcc missing again #438

cuda-nvcc missing again #438

dhruvbalwada commented Feb 1, 2023

scottyhq commented Feb 3, 2023 •

edited

dhruvbalwada commented Feb 3, 2023 •

edited

ngam commented Feb 4, 2023

ngam commented Feb 4, 2023

ngam commented May 15, 2023

weiji14 commented Jun 27, 2023

ngam commented Jul 5, 2023

weiji14 commented Sep 14, 2023

weiji14 commented May 21, 2024

cuda-nvcc missing again #438

cuda-nvcc missing again #438

Comments

dhruvbalwada commented Feb 1, 2023

The problem:

scottyhq commented Feb 3, 2023 • edited

dhruvbalwada commented Feb 3, 2023 • edited

ngam commented Feb 4, 2023

ngam commented Feb 4, 2023

ngam commented May 15, 2023

weiji14 commented Jun 27, 2023

ngam commented Jul 5, 2023

weiji14 commented Sep 14, 2023

weiji14 commented May 21, 2024

scottyhq commented Feb 3, 2023 •

edited

dhruvbalwada commented Feb 3, 2023 •

edited