Multi-Node DeviceStatsMonitor #19700
Unanswered
oabuhamdan
asked this question in
DDP / multi-GPU / multi-node
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Greetings,
I run multi-node training with torchrun command, CPU device, and with 3 nodes.
I notice that the DeviceStatsMonitor only capture stats for Node with node_rank 0.
Is there a way to get these data captured for all nodes?
I am open to a solution with creating my own callbacks.
Beta Was this translation helpful? Give feedback.
All reactions