Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable profiler only on rank 0 #885

Open
rohan-varma opened this issue Apr 26, 2024 · 1 comment
Open

Enable profiler only on rank 0 #885

rohan-varma opened this issue Apr 26, 2024 · 1 comment
Assignees

Comments

@rohan-varma
Copy link
Member

Profiler is a little hard to use for distributed training since it gets enabled for all ranks. This results in the trace file being overwritten and it being unknown which rank the profile came from.

For now, we can probably enable the profiler only on rank 0. We might lose some information such as detecting stragglers on non-zero ranks, but I'm not particularly concerned about straggle issue for single node use cases.

@rohan-varma rohan-varma self-assigned this Apr 26, 2024
@SLR722 SLR722 self-assigned this May 7, 2024
@SLR722
Copy link
Contributor

SLR722 commented May 7, 2024

Torch profiler was added as an optional component in #627 and we show case how to use it in lora_finetune_single_device.py recipe which won't have this issue. To address this, we have 2 options

  1. Add additional show case in one distributed recipe
  2. Move the torch profiler show case to distributed recipe if we think distributed recipe has more show case value

cc: @kartikayk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants