You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Profiler is a little hard to use for distributed training since it gets enabled for all ranks. This results in the trace file being overwritten and it being unknown which rank the profile came from.
For now, we can probably enable the profiler only on rank 0. We might lose some information such as detecting stragglers on non-zero ranks, but I'm not particularly concerned about straggle issue for single node use cases.
The text was updated successfully, but these errors were encountered:
Torch profiler was added as an optional component in #627 and we show case how to use it in lora_finetune_single_device.py recipe which won't have this issue. To address this, we have 2 options
Add additional show case in one distributed recipe
Move the torch profiler show case to distributed recipe if we think distributed recipe has more show case value
Profiler is a little hard to use for distributed training since it gets enabled for all ranks. This results in the trace file being overwritten and it being unknown which rank the profile came from.
For now, we can probably enable the profiler only on rank 0. We might lose some information such as detecting stragglers on non-zero ranks, but I'm not particularly concerned about straggle issue for single node use cases.
The text was updated successfully, but these errors were encountered: