Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProfilerPluginLoader fails due to protobuf versions #609

Open
Inquisitive-ME opened this issue Apr 21, 2023 · 12 comments
Open

ProfilerPluginLoader fails due to protobuf versions #609

Inquisitive-ME opened this issue Apr 21, 2023 · 12 comments

Comments

@Inquisitive-ME
Copy link

Using what is available as the latest versions from pip I get the following error

E0421 08:52:53.103637 140219640803328 application.py:125] Failed to load plugin ProfilePluginLoader.load; ignoring it.
Traceback (most recent call last):
File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard/backend/application.py", line 123, in TensorBoardWSGIApp
plugin = loader.load(context)
File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/profile_plugin_loader.py", line 75, in load
from tensorboard_plugin_profile import profile_plugin
File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/profile_plugin.py", line 36, in
from tensorboard_plugin_profile.convert import raw_to_tool_data as convert
File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/convert/raw_to_tool_data.py", line 29, in
from tensorboard_plugin_profile.convert import input_pipeline_proto_to_gviz
File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/convert/input_pipeline_proto_to_gviz.py", line 28, in
from tensorboard_plugin_profile.protobuf import input_pipeline_pb2
File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/protobuf/input_pipeline_pb2.py", line 17, in
from tensorboard_plugin_profile.protobuf import diagnostics_pb2 as plugin_dot_tensorboard__plugin__profile_dot_protobuf_dot_diagnostics__pb2
File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/protobuf/diagnostics_pb2.py", line 36, in
_descriptor.FieldDescriptor(
File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/google/protobuf/descriptor.py", line 561, in new
_message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:

  1. Downgrade the protobuf package to 3.20.x or lower.
  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

It seems like the profiler plugin is incompatible with the latest tensorflow and tensorboard.

@rdbis
Copy link

rdbis commented Apr 28, 2023

same observation on my setup:
clean install of Ubuntu 22.04.2
tensorboard 2.12.2
tensorflow & CUDA installation according to the official tensorflow instructions: https://www.tensorflow.org/install/pip?hl=en#linux

setting PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python does not help either:
W0428 20:31:10.595216 139654082852416 security_validator.py:60] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'

No profile data was found.

@marcosfelt
Copy link

same observation on my setup: clean install of Ubuntu 22.04.2 tensorboard 2.12.2 tensorflow & CUDA installation according to the official tensorflow instructions: https://www.tensorflow.org/install/pip?hl=en#linux

setting PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python does not help either: W0428 20:31:10.595216 139654082852416 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'

No profile data was found.

I hav the same issue!

@rdbis
Copy link

rdbis commented Jun 4, 2023

Ok, I think I found the rootcause of the problem. It is caused by a bug in the Bazel configuration files. All profiler protobuf stubs are generated using the ancient protobuf package ( 3.8.0 ). Which makes them incompatible with protobuf stubs from tensorboad/tensorflow as they are generated with the newer protobuf package >= 3.19.6. Tensorboard has an explicit dependency to load protobuf 3.19.6 for stub generation. Such a dependency is missing in the Bazel configuration for the profiler - instead it has a dependency on tensorflow 2.1.0 where protobuf 3.8.0 is loaded:
in tensorflow/workspace.bzl
# 310ba5ee72661c081129eb878c1bbcec936b20f0 is based on 3.8.0 with a fix for protobuf.bzl.
PROTOBUF_URLS = [
"https://storage.googleapis.com/mirror.tensorflow.org/github.com/protocolbuffers/protobuf/archive/310ba5ee72661c081129eb878c1bbcec936b20f0.tar.gz",
"https://github.com/protocolbuffers/protobuf/archive/310ba5ee72661c081129eb878c1bbcec936b20f0.tar.gz",
]
PROTOBUF_SHA256 = "b9e92f9af8819bbbc514e2902aec860415b70209f31dfc8c4fa72515a5df9d59"
PROTOBUF_STRIP_PREFIX = "protobuf-310ba5ee72661c081129eb878c1bbcec936b20f0"

this makes tensorflow profiler incompatible with all tensorboard/tensorflow releases based on protobuf >= 3.19.0

@cliveverghese
Copy link
Collaborator

#636 Fixes this isuse, You can verify that the change works by downloading tbp-nightly.

@rdbis
Copy link

rdbis commented Jun 8, 2023

Thanks, It looks like the fix solves the problem with protobuf compatibility. However still I cannot see any profile data in the browser, the log from tensorboard/profiler shows the followng:

NOTE: Using experimental fast data loading logic. To disable, pass
"--load_fast=false" and report issues on GitHub. More details:
tensorflow/tensorboard#4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.14.0a20230604 at http://localhost:6006/ (Press CTRL+C to quit)
W0608 18:27:59.310396 140681651652160 security_validator.py:60] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'
W0608 19:31:01.143541 140681441916480 security_validator.py:60] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'
W0608 19:33:49.505171 140681358022208 security_validator.py:60] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'
W0608 19:35:26.398867 140681358022208 security_validator.py:60] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'
W0608 19:35:32.885195 140681358022208 security_validator.py:60] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'
W0608 19:48:56.000323 140681525810752 security_validator.py:60] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'

@cliveverghese
Copy link
Collaborator

Hi,

Could you provide information regarding the version of the packages installed on your system?

I don't see a possible error condition within the logs provided. Do you see any errors within the browser console?.

@rdbis
Copy link

rdbis commented Jun 9, 2023

Sure, I can recreate this problem with latest:
tf-nightly - 2.14.0.dev20230609
tb-nightly - 2.14.0a20230609
tbp-nightly - 2.14.0a20230609

log from tensorboard:
2023-06-09 21:37:04.990730: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-06-09 21:37:05.429546: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

NOTE: Using experimental fast data loading logic. To disable, pass
"--load_fast=false" and report issues on GitHub. More details:
tensorflow/tensorboard#4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.14.0a20230609 at http://localhost:6006/ (Press CTRL+C to quit)
W0609 21:50:52.116963 140076514244160 security_validator.py:60] In 3.0, this warning will become an error:
Illegal Content-Security-Policy for script-src: 'unsafe-inline'
Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'

here is tf execution log from my app:
2023-06-09 21:41:26.653283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19414 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6
2023-06-09 21:41:26.853218: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-06-09 21:41:26.853243: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
2023-06-09 21:41:26.853269: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1671] Profiler found 1 GPUs
2023-06-09 21:41:26.859587: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2023-06-09 21:41:26.859639: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
Epoch 1/2
2023-06-09 21:41:28.795830: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8902
2023-06-09 21:41:29.981827: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-06-09 21:41:30.143119: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb678a86a50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-06-09 21:41:30.143143: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-06-09 21:41:30.155607: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2023-06-09 21:41:30.276619: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
499/924 [===============>..............] - ETA: 1:55 - loss: 397827.8438 - accuracy: 0.29602023-06-09 21:43:46.324071: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing.
2023-06-09 21:43:46.324092: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started.
519/924 [===============>..............] - ETA: 1:49 - loss: 382497.2812 - accuracy: 0.29912023-06-09 21:43:52.028953: I tensorflow/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data.
2023-06-09 21:43:52.030085: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed
2023-06-09 21:43:52.047298: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_collector.cc:541] GpuTracer has collected 2475 callback api events and 2450 activity events.
2023-06-09 21:43:52.061061: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down.
2023-06-09 21:43:52.066437: I tensorflow/tsl/profiler/rpc/client/save_profile.cc:144] Collecting XSpace to repository: /home/jozef/logs/20230609-214008/plugins/profile/2023_06_09_21_43_52/jozef-desktop.xplane.pb
924/924 [==============================] - 256s 273ms/step - loss: 1571177.1250 - accuracy: 0.3012
Epoch 2/2
924/924 [==============================] - 253s 274ms/step - loss: 28949560.0000 - accuracy: 0.2877

this is the list of files created in log directory during the program execution:
./plugins/profile/2023_06_09_21_43_52/jozef-desktop.xplane.pb
./train/events.out.tfevents.1686339687.jozef-desktop.6744.0.v2

in the browser - in the profiler tab the message "No profile data was found." appears

@cliveverghese
Copy link
Collaborator

What is the logdir specified when starting tensorboard?

@rdbis
Copy link

rdbis commented Jun 9, 2023

tensorboard --logdir ~/logs

@cliveverghese
Copy link
Collaborator

Seems like an issue with the logdir path, It should be /home/jozef/logs/20230609-214008. The tensorflow execution is receiving that as the logdir for the profiling request.

You could try running tensorboard --logdir /home/jozef/logs/20230609-214008

@rdbis
Copy link

rdbis commented Jun 9, 2023

Wow, with tensorboard --logdir /home/jozef/logs/20230609-214008 it works like a charm. Thanks for this workaroud. 👍
So, it looks like there is an issue with handling the logdir parameter. Tensorboard shows properly all the collected profile runs in tensorboard browser interface, however selecting specific one via tensorboard web interface is not working properly right now.
To make it work path to specific profiler run must be provided as input parameter to tensorboard, right? And tensorboard must be restarted with new logdir parameter everytime new profile data is collected. ?

@pritamdodeja
Copy link

Seems like an issue with the logdir path, It should be /home/jozef/logs/20230609-214008. The tensorflow execution is receiving that as the logdir for the profiling request.

You could try running tensorboard --logdir /home/jozef/logs/20230609-214008

@cliveverghese I think this is indicative of a broader issue starting in 2.12 (possibly earlier) where the location of the profile data has changed. This is breaking profiler in tensorboard. When I manually copy the files to the right place, tensorboard profiler works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants