Enable profiler only on rank 0 #885

rohan-varma · 2024-04-26T20:57:37Z

Profiler is a little hard to use for distributed training since it gets enabled for all ranks. This results in the trace file being overwritten and it being unknown which rank the profile came from.

For now, we can probably enable the profiler only on rank 0. We might lose some information such as detecting stragglers on non-zero ranks, but I'm not particularly concerned about straggle issue for single node use cases.

SLR722 · 2024-05-07T02:42:51Z

Torch profiler was added as an optional component in #627 and we show case how to use it in lora_finetune_single_device.py recipe which won't have this issue. To address this, we have 2 options

Add additional show case in one distributed recipe
Move the torch profiler show case to distributed recipe if we think distributed recipe has more show case value

cc: @kartikayk

rohan-varma self-assigned this Apr 26, 2024

SLR722 self-assigned this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable profiler only on rank 0 #885

Enable profiler only on rank 0 #885

rohan-varma commented Apr 26, 2024

SLR722 commented May 7, 2024

Enable profiler only on rank 0 #885

Enable profiler only on rank 0 #885

Comments

rohan-varma commented Apr 26, 2024

SLR722 commented May 7, 2024