Profiler does not Seem to Output Timesteps in xplane.pb - "No step marker observed and hence the step time is unknown" from Tensorboard #66410
Labels
comp:apis
Highlevel API related issues
comp:tensorboard
Tensorboard related issues
TF 2.15
For issues related to 2.15.x
type:bug
Bug
WIP
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
No
Source
source
TensorFlow version
2.15.0 (cuda120py39hb94c71b_3 from conda-forge)
Custom code
No
OS platform and distribution
Ubuntu Jammy in podman Container
Mobile device
No response
Python version
3.9.18
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
12.4 (cuda-cupti 12.4.127 @ h59595ed_1 from conda-forge)
GPU model and memory
RTX 3090, 24GiB
Current behavior?
I am in the process of writing a custom loss function, and trying to profile it to see where resources are currently used.
I have installed newer CUDA drives, the latest release of Tensorflow (2.15), the CUDA PTI libraries, and other dependencies needed for the Tensorboard profiler plugin.
I can run an example with the profiler and get what looks like reasonable data. With my own code, I get a warning back from
_pywrap_profiler.xspace_to_tools_data()
that no timesteps are contained in the file, and thus, some of the useful profiling information is absent/unusable. I cut back the example and found that if I use the MSE loss, the profile is complete; if I change to my own loss function, the timesteps are no longer output.Given that the message is coming back from the core profiler library, and is contained within the encoded protocol buffers, I believe that this is an issue with the profiler and the main library, rather than the tensorboard utility or the profiling plugin.
The loss function is reasonably complicated, so I suspected at first the large files might be an issue. However, when reducing down to the toy example above, whilst there's a difference, the overhead in the files makes this difference much closer:
I previously had warning s that the processor has dropped frames due to insufficient buffer space, but with this trivially small data size, those are gone.
Standalone code to reproduce the issue
Relevant log output
The text was updated successfully, but these errors were encountered: