Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - spawn by jupyterhub on K8s ; tensorflow doesn't recognize the GPU cards #1831

Closed
EajksEajks opened this issue Nov 16, 2022 · 3 comments
Labels
status:Need Info We believe we need more information about an issue from the reporting user to help, debug, fix type:Bug A problem with the definition of one of the docker images maintained here

Comments

@EajksEajks
Copy link

What docker image(s) are you using?

tensorflow-notebook

OS system and architecture running docker image

ubuntu 20.04 / amd64

What Docker command are you running?

Dell PowerEdge R740 w/ 2 Nvidia A30 GPU cards
Host OS = Ubuntu 20.04.5
Kubernetes Cluster = 1.25.3
jupyterhub for K8s = 2.0.0
tensorflow-notebook = 2022-11-15

The container is spawned by jupyterhub.

How to Reproduce the problem?

Spawn a server requesting access to 1 or 2 Nvidia A30 GPU cards.

Under the notebook spawned by Jupyter Hub, in a terminal,
nvidia-smi lists the requested amount of GPUs (1 or 2).

$ nvidia-smi
Wed Nov 16 17:25:05 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   28C    P0    27W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

In a notebook,

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

returns 0.

Note also that there is also another strange behavior. When I import tensorflow the first time, I get the following message, but when I import it right away a second time, it doesn't complain anymore.

2022-11-16 17:23:18.253523: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-16 17:23:18.314185: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.

Command output

No response

Expected behavior

No response

Actual behavior

tensorflow doesn't recognize any GPU card although nvidia-smi does.

Anything else?

No response

@EajksEajks EajksEajks added the type:Bug A problem with the definition of one of the docker images maintained here label Nov 16, 2022
@mathbunnyru
Copy link
Member

Hi, @EajksEajks!
I have no experience with running GPUs in docker, but I will try to help.

  1. Could you please reproduce this behaviour without using jupyterhub/k8s and so on?
    A simple docker run command should be enough, making it easier to debug this.

Under the notebook spawned by Jupyter Hub, in a terminal,
nvidia-smi lists the requested amount of GPUs (1 or 2).

I don't think you're running this inside the container, because jupyter/tensorflow-notebook doesn't contain nvidia libraries.

Overall, I also don't think our images are designed to support GPU properly.
At least, we're not installing any nvidia drivers both on host machine or the container.
https://www.howtogeek.com/devops/how-to-use-an-nvidia-gpu-with-docker-containers/
So, I'm not sure GPUs should work at all.
There is a separate project, which tries to make it work, please, take a look.
https://github.com/iot-salzburg/gpu-jupyter
https://hub.docker.com/r/cschranz/gpu-jupyter

Note also that there is also another strange behavior. When I import tensorflow the first time, I get the following message, but when I import it right away a second time, it doesn't complain anymore.

This is how Python works, if you reimport the same module twice (in the same process), Python won't do it once again, that's why you only get the message for the first time.

@mathbunnyru mathbunnyru added the status:Need Info We believe we need more information about an issue from the reporting user to help, debug, fix label Nov 16, 2022
@EajksEajks
Copy link
Author

Hi, @mathbunnyru

I was expecting tensorflow-notebook to support GPU cards out of the box as it is pretty unefficient to do machine learning without any proper hardware. Moreover I was misled by the JupyterHub installation instructions which mentions how to assign GPU cards to spawned notebooks.

On the K8s cluster we are running on, the gpu-operator from nvidia is installed and the GPU are easily found as typing !nvidia-smi in the notebook shows. Now I understand that the CUDA libs are simply not installed :-) So I'll have a look to the projects you mention to find the way to have them installed.

Note that it's a pity that people have to take the source code of your notebook to generate a new one with the CUDA libs installed. It would make much more sense that the tensorflow-notebook supports the GPU cards.

Thx for your help.

@mathbunnyru
Copy link
Member

Note that it's a pity that people have to take the source code of your notebook to generate a new one with the CUDA libs installed. It would make much more sense that the tensorflow-notebook supports the GPU cards.

I understand your frustration. The thing is we're building the whole bunch of image, not just one.
We also have an issue which suggests adding the images built on top of GPU enabled images.
#1557

I think it's actually possible, and can be done in this project without hurting anyone. But I haven't yet seen a PR, that tries to achieve it. I'm also not very sure about NVIDIA's license on how we can use their images.

For now, I think the easiest way is to just use the project I mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:Need Info We believe we need more information about an issue from the reporting user to help, debug, fix type:Bug A problem with the definition of one of the docker images maintained here
Projects
None yet
Development

No branches or pull requests

2 participants