Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-gpu training is much lower than single gpu (due to additional processes?) #19796

Open
RitaQian-westlake opened this issue Apr 22, 2024 · 0 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.1.x

Comments

@RitaQian-westlake
Copy link

Bug description

I use DeepSpeed ZeRO Stage 2 Offload integrated in lightning to train my model. When I use single gpu, the training time for one epoch is about 5h. However, When I use two ranks, the time extends to about 12.5h.
Something strange is that when I use single gpu, there is only one process by checking nvidia-smi, while it becomes two processes on each rank for 2-gpu training.
If I use large batch size, it will report Out of CUDA memory error and exit. For single-gpu, the training just terminated. However, for 2-gpu, only one process on each rank exits and the remaining one is still running, after that the training continues normally with much faster speed (2.5h/epoch), which seems to be the desired speed (half of what with single-gpu).
It looks that the additional process limits the speed of multi-gpu training. Is its existence normal? If so, what's its function? Is there anyway to avoid it? (Obviously I don't want to exit it by making Out of CUDA memory error manually.)

What version are you seeing the problem on?

v2.1

How to reproduce the bug

trainer = Trainer(devices=[0,1],  # for single-gpu: 1
                  accelerator="gpu",
                  callbacks=[epoch_end_callback, checkpoint_callback, earlystop_callback],
                  min_epochs=min_epoch,
                  max_epochs=max_epoch,
                  deterministic=True,
                  benchmark=True,
                  strategy='deepspeed_stage_2_offload',
                  logger=logger,
                  profiler='simple'
                  )

Optimizer: `deepspeed.ops.adam.DeepSpeedCPUAdam`

Error messages and logs

2-gpu:
image
single-gpu:
image

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer
#- PyTorch Lightning Version (e.g., 1.5.0): 2.1.3
#- PyTorch Version (e.g., 2.0): 2.0.1
#- Python version (e.g., 3.9): 3.10.12
#- OS (e.g., Linux): CentOS Linux release 7.4.1708 (Core)
#- CUDA/cuDNN version: CUDA 11.7 + cuDNN 8.5
#- GPU models and configuration: 2 ranks in one A40 GPU
#- How you installed Lightning(`conda`, `pip`, source): conda

More info

No response

@RitaQian-westlake RitaQian-westlake added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.1.x
Projects
None yet
Development

No branches or pull requests

1 participant