Tensorflow CPU vs GPU #68

mikegerber · 2020-02-25T10:22:49Z

https://github.com/OCR-D/ocrd_all#conflicting-requirements states that ocrd_calamari would depend on tensorflow-gpu 1.14.x, but it depends on 1.15.2 since recently.
There is also still some solvable(!) problem/confusion about the different TensorFlow flavours. For tensorflow 1.15.*, one can simply depend on tensorflow-gpu == 1.15.* for CPU and GPU support. I am not aware of any issues using tensorflow-gpu's CPU fallback on CPU, I use it every day. (There was some source of additional confusion because TF changed their recommendation for 1.15 only.)
I just recently discovered that one can depend on an approximate version, e.g. tensorflow-gpu ~= 1.15.2 or tensorflow == 1.15.*

TL&DR: My recommendation would be that our TF1 projects just use tensorflow-gpu == 1.15.* for CPU and GPU support and be done with this problem.

The text was updated successfully, but these errors were encountered:

bertsky · 2020-02-25T10:41:18Z

1. https://github.com/OCR-D/ocrd_all#conflicting-requirements states that

Yes, that section needs to be updated (cf. #35). But the real problem is that TF2 dependencies are lurking everywhere, so we will very soon have the unacceptable state that no catch-all venv (satisfying both TF1 and TF2 modules) is possible anymore. By then, a new solution needs to be in place, which (at least partially) isolates venvs from each other again.

2\. For tensorflow 1.15.*, one can simply depend on `tensorflow-gpu == 1.15.*` _for CPU **and** GPU_ support. I am not aware of any issues using `tensorflow-gpu`'s CPU fallback on CPU

But isn't that equally true for using tensorflow == 1.15.*? It is the variant with a -gpu suffix that is going to be dropped eventually IIUC.

mikegerber · 2020-02-25T10:57:20Z

For tensorflow 1.15.*, one can simply depend on tensorflow-gpu == 1.15.* for CPU and GPU support. I am not aware of any issues using tensorflow-gpu's CPU fallback on CPU

But isn't that equally true for using tensorflow == 1.15.*? It is the variant with a -gpu suffix that is going to be dropped eventually IIUC.

Nah, they had recommended tensorflow-gpu for TF2 CPU+GPU but changed it again to just tensorflow 🤣 So if tensorflow == 1.15.* has GPU support I am happy with that convention, too.

stweil · 2020-02-25T11:01:02Z

Is there a chance to upgrade everything to Tensorflow 2?

bertsky · 2020-02-25T11:05:37Z

Is there a chance to upgrade everything to Tensorflow 2?

Code migration is not so difficult – yes, that could be streamlined in a coordinated PR effort. But IIRC the hard problem is that models will be incompatible and thus have to be retrained. This is something that the module providers have to decide on whether and when it is prudent themselves. And it's highly unlikely the time frames will converge.

mikegerber · 2020-02-25T11:13:19Z

Of course there is a chance, it just involves quite a bit of work. For a maintained software like
ocrd_calamari:

Training a new model for a week (done)
Updating
Testing
Proper evaluation (no regression?)

This stuff is a. not super high on priority lists because of effort vs. benefit, b. takes time and c. sometimes depends on other software involved. ocrd_all will always have to deal with version conflicts.

And I imagine there are research projects that have no maintainance anymore or maybe just some poor PhD student with other priorities.

mikegerber · 2020-02-25T11:33:15Z

But isn't that equally true for using tensorflow == 1.15.*?

I do not get GPU support with that, only CPU. With tensorflow-gpu == 1.15.* I have no issues. But I'll try again after lunch, to make sure.

stweil · 2020-02-25T11:41:38Z

But IIRC the hard problem is that models will be incompatible and thus have to be retrained.

Maybe existing models can be converted, too?

mikegerber · 2020-02-25T13:37:55Z

But IIRC the hard problem is that models will be incompatible and thus have to be retrained.
Maybe existing models can be converted, too?

In some cases this is possible. But not for e.g. Calamari 0.3.5 → 1.0, unless they support it.

mikegerber · 2020-02-25T16:41:56Z

But isn't that equally true for using tensorflow == 1.15.*?

I do not get GPU support with that, only CPU. With tensorflow-gpu == 1.15.* I have no issues. But I'll try again after lunch, to make sure.

Alright, these are my results using the below script:

== tensorflow==1.15.*, CUDA_VISIBLE_DEVICES='0'
Already using interpreter /usr/bin/python3
2020-02-25 17:21:35.205395: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 17:21:35.220274: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2799925000 Hz
2020-02-25 17:21:35.220640: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564f5da0e220 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:21:35.220655: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
GPU available: False
== tensorflow==1.15.*, CUDA_VISIBLE_DEVICES=''
Already using interpreter /usr/bin/python3
2020-02-25 17:21:55.577941: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 17:21:55.593243: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2799925000 Hz
2020-02-25 17:21:55.593497: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5594505bb720 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:21:55.593532: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
GPU available: False
== tensorflow-gpu==1.15.*, CUDA_VISIBLE_DEVICES='0'
Already using interpreter /usr/bin/python3
2020-02-25 17:22:27.264675: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 17:22:27.281148: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2799925000 Hz
2020-02-25 17:22:27.281383: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f6815f70 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:22:27.281398: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-25 17:22:27.282909: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-25 17:22:27.424313: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b5f68a56b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:22:27.424336: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
2020-02-25 17:22:27.424711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.86
pciBusID: 0000:01:00.0
2020-02-25 17:22:27.424872: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-02-25 17:22:27.425769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-02-25 17:22:27.426610: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-02-25 17:22:27.426867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-02-25 17:22:27.428707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-02-25 17:22:27.430106: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-02-25 17:22:27.433060: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-25 17:22:27.433717: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2020-02-25 17:22:27.433752: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-02-25 17:22:27.434268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-25 17:22:27.434279: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 
2020-02-25 17:22:27.434284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N 
2020-02-25 17:22:27.434897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 6786 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5)
GPU available: True
== tensorflow-gpu==1.15.*, CUDA_VISIBLE_DEVICES=''
Already using interpreter /usr/bin/python3
2020-02-25 17:22:58.971329: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-25 17:22:58.987226: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2799925000 Hz
2020-02-25 17:22:58.987497: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x558cc0be40d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-25 17:22:58.987526: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-25 17:22:58.989005: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-25 17:22:58.992375: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2020-02-25 17:22:58.992396: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: b-pc30533
2020-02-25 17:22:58.992402: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: b-pc30533
2020-02-25 17:22:58.992431: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.59.0
2020-02-25 17:22:58.992449: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.59.0
2020-02-25 17:22:58.992455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 440.59.0
GPU available: False

Script:

#!/bin/sh
for package in "tensorflow==1.15.*" "tensorflow-gpu==1.15.*"; do
  for CUDA_VISIBLE_DEVICES in "0" ""; do

    echo "== $package, CUDA_VISIBLE_DEVICES='$CUDA_VISIBLE_DEVICES'"

    export CUDA_VISIBLE_DEVICES

    venv=/tmp/tmp.$RANDOM
    virtualenv --quiet -p /usr/bin/python3 $venv
    . $venv/bin/activate

    pip3 install --quiet --upgrade pip
    pip3 install --quiet "$package"

    python3 -c 'import tensorflow as tf; print("GPU available:", tf.test.is_gpu_available())'

  done
done

mikegerber · 2020-02-25T16:42:41Z

So, tensorflow-gpu==1.15.* is the right choice for TF1, it gives GPU and CPU support. (The script does not check for CPU support, I know that -gpu works for CPU too)

bertsky · 2020-02-26T13:10:08Z

So, tensorflow-gpu==1.15.* is the right choice for TF1, it gives GPU and CPU support. (The script does not check for CPU support, I know that -gpu works for CPU too)

Indeed! We should open issues/PRs to all directly or indirectly affected module repos.

(Strange though, I have a clear memory of getting GPU support out of a tensorflow PyPI release. But maybe that was in an Nvidia Docker image, or TF 2.)

mikegerber · 2020-03-04T08:59:04Z

(Strange though, I have a clear memory of getting GPU support out of a tensorflow PyPI release. But maybe that was in an Nvidia Docker image, or TF 2.)

Behaviour changed between releases, so that explains it:

https://web.archive.org/web/diff/20191015141958/20191208214348/https://www.tensorflow.org/install/pip

(Left: October 2019, right: February 2020)

stweil · 2020-04-23T08:17:57Z

With tensorflow-gpu == 1.15.* I have no issues.

Bad news: With tensorflow-gpu==1.15.* I have issues because it does not work on macOS. tensorflow==1.15.* works fine there.

bertsky · 2020-04-23T08:25:09Z

With tensorflow-gpu == 1.15.* I have no issues.

Bad news: With tensorflow-gpu==1.15.* I have issues because it does not work on macOS. tensorflow==1.15.* works fine there.

These TF devs keep driving me mad. I thought we had this solved by now.

Okay, can you re-label the prebuilt tensorflow as tensorflow-gpu somehow?
Or should we build our own TF wheels under the correct name for macOS and include them in the supply chain?

stweil · 2020-04-23T08:59:05Z

Okay, can you re-label the prebuilt tensorflow as tensorflow-gpu somehow?

Yes, that is possible. Of course there remains the conflict between TF1 and TF2, so the resulting installation won't work.

stweil · 2020-04-23T09:00:53Z

Building TF is a nightmare. It takes days for ARM, and I expect many hours for macOS.

bertsky · 2020-04-23T10:47:45Z

Okay, can you re-label the prebuilt tensorflow as tensorflow-gpu somehow?

Yes, that is possible. Of course there remains the conflict between TF1 and TF2, so the resulting installation won't work.

I don't think this is the right approach. First of all, you don't discriminate the version you are delegating to. And second, this requires to install tensorflow from the same base version (which yes, then makes it impossible to have both TF1 and TF2 installed at the same time).

I was thinking along the lines of modifying the name in the official wheel.

bertsky · 2020-04-23T10:50:34Z

Building TF is a nightmare. It takes days for ARM, and I expect many hours for macOS.

I know. And it never quite works out of the box as documented (at least for me). Too fast to die, too slow to live.

But building from scratch trivially gives you whatever package name you want. (So we could have tensorflow for TF2 and tensorflow-gpu for TF1 – even if it does not have actual GPU support on macOS.) But I am still more inclined to the wheel patching approach.

@kba your thoughts?

bertsky · 2020-08-20T21:06:08Z

So, except for ARM and macOS and Python 3.8 support (it just keeps growing) – which we should probably discuss in #147 – I think this has been solved by #118. @mikegerber can we close?

mikegerber mentioned this issue Feb 25, 2020

Allow building with thin module Docker containers #69

Open

mikegerber changed the title ~~Outdated info about conflicting requirements~~ Tensorflow CPU vs GPU Feb 25, 2020

kba added a commit to OCR-D/OLD_ocrd_anybaseocr that referenced this issue Mar 6, 2020

require tensorflow==1.15.*, cf. OCR-D/ocrd_all#68

958a41d

bertsky mentioned this issue Jun 26, 2020

group some modules into isolated venvs… #118

Merged

bertsky closed this as completed Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow CPU vs GPU #68

Tensorflow CPU vs GPU #68

mikegerber commented Feb 25, 2020 •

edited

bertsky commented Feb 25, 2020

mikegerber commented Feb 25, 2020

stweil commented Feb 25, 2020

bertsky commented Feb 25, 2020

mikegerber commented Feb 25, 2020 •

edited

mikegerber commented Feb 25, 2020

stweil commented Feb 25, 2020

mikegerber commented Feb 25, 2020

mikegerber commented Feb 25, 2020

mikegerber commented Feb 25, 2020 •

edited

bertsky commented Feb 26, 2020

mikegerber commented Mar 4, 2020 •

edited

stweil commented Apr 23, 2020 •

edited

bertsky commented Apr 23, 2020

stweil commented Apr 23, 2020

stweil commented Apr 23, 2020

bertsky commented Apr 23, 2020

bertsky commented Apr 23, 2020

bertsky commented Aug 20, 2020

Tensorflow CPU vs GPU #68

Tensorflow CPU vs GPU #68

Comments

mikegerber commented Feb 25, 2020 • edited

bertsky commented Feb 25, 2020

mikegerber commented Feb 25, 2020

stweil commented Feb 25, 2020

bertsky commented Feb 25, 2020

mikegerber commented Feb 25, 2020 • edited

mikegerber commented Feb 25, 2020

stweil commented Feb 25, 2020

mikegerber commented Feb 25, 2020

mikegerber commented Feb 25, 2020

mikegerber commented Feb 25, 2020 • edited

bertsky commented Feb 26, 2020

mikegerber commented Mar 4, 2020 • edited

stweil commented Apr 23, 2020 • edited

bertsky commented Apr 23, 2020

stweil commented Apr 23, 2020

stweil commented Apr 23, 2020

bertsky commented Apr 23, 2020

bertsky commented Apr 23, 2020

bertsky commented Aug 20, 2020

mikegerber commented Feb 25, 2020 •

edited

mikegerber commented Feb 25, 2020 •

edited

mikegerber commented Feb 25, 2020 •

edited

mikegerber commented Mar 4, 2020 •

edited

stweil commented Apr 23, 2020 •

edited