Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed view empty and no communication shown #782

Open
aamijar opened this issue Jul 15, 2023 · 4 comments
Open

Distributed view empty and no communication shown #782

aamijar opened this issue Jul 15, 2023 · 4 comments
Labels
bug Something isn't working plugin PyTorch Profiler TensorBoard Plugin related

Comments

@aamijar
Copy link

aamijar commented Jul 15, 2023

Hi, I am using the sample script in this repository resnet50_ddp_profiler.py from https://github.com/pytorch/kineto/blob/main/tb_plugin/examples/resnet50_ddp_profiler.py

Using

Python3.8
torch=2.0.1
torch-tb-profiler=0.4.3 # built from source

In tensorboard in the overview view the communication is 0.
In the distributed view:

  • there are no bar charts shown for Synchronizing/Communication Overview.
  • the table at the bottom called Communication Operation stats has 0 values in columns total latency, avg latency, data transfer time, avg data transfer time.

When I try using

Python3.8
torch=1.11.0
torch-tb-profiler=0.4.3 # built from source

There are no issues and the views show up properly.

However even for torch=1.12+ there are issues in communication and distributed view not showing up properly.

Does anyone have any insight into why this may be the case?

@aamijar
Copy link
Author

aamijar commented Jul 15, 2023

I'm looking at the .json logs for both of these runs.

An observation I found is that the torch=2.0.1 generated .json
specifically for the objects in the json that has the name "ncclKernel_AllReduce_RING_LL_Sum_float(ncclDevComm*, unsigned long, ncclWork*)"

External id and correlation fields are the same value

whereas in torch=1.11.0
External id and correlation fields have different values

in torch=1.11.0
the External id also match with various other .json objects where the name can be cudaEventRecord, cudaLaunchKernel etc.

This is not the case in the torch=2.0.1 generated .json

@aaronenyeshi aaronenyeshi added bug Something isn't working plugin PyTorch Profiler TensorBoard Plugin related labels Jul 18, 2023
@aamijar
Copy link
Author

aamijar commented Jul 18, 2023

@aaronenyeshi Do you know of any ways to resolve this and are you able to replicate the results from above?

@aaronenyeshi
Copy link
Member

Unfortunately, we are lacking resources to fix tb_plugin bugs. Plans for it are still pending.

However, the OSS community is free to submit fixes for these issues via Github PRs.

@npuichigo
Copy link

Any plan on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working plugin PyTorch Profiler TensorBoard Plugin related
Projects
None yet
Development

No branches or pull requests

3 participants