Multi-gpu training is much lower than single gpu (due to additional processes?) #19796

RitaQian-westlake · 2024-04-22T13:37:36Z

Bug description

I use DeepSpeed ZeRO Stage 2 Offload integrated in lightning to train my model. When I use single gpu, the training time for one epoch is about 5h. However, When I use two ranks, the time extends to about 12.5h.
Something strange is that when I use single gpu, there is only one process by checking nvidia-smi, while it becomes two processes on each rank for 2-gpu training.
If I use large batch size, it will report Out of CUDA memory error and exit. For single-gpu, the training just terminated. However, for 2-gpu, only one process on each rank exits and the remaining one is still running, after that the training continues normally with much faster speed (2.5h/epoch), which seems to be the desired speed (half of what with single-gpu).
It looks that the additional process limits the speed of multi-gpu training. Is its existence normal? If so, what's its function? Is there anyway to avoid it? (Obviously I don't want to exit it by making Out of CUDA memory error manually.)

What version are you seeing the problem on?

v2.1

How to reproduce the bug

trainer = Trainer(devices=[0,1],  # for single-gpu: 1
                  accelerator="gpu",
                  callbacks=[epoch_end_callback, checkpoint_callback, earlystop_callback],
                  min_epochs=min_epoch,
                  max_epochs=max_epoch,
                  deterministic=True,
                  benchmark=True,
                  strategy='deepspeed_stage_2_offload',
                  logger=logger,
                  profiler='simple'
                  )

Optimizer: `deepspeed.ops.adam.DeepSpeedCPUAdam`

Error messages and logs

2-gpu:

single-gpu:

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): Trainer
#- PyTorch Lightning Version (e.g., 1.5.0): 2.1.3
#- PyTorch Version (e.g., 2.0): 2.0.1
#- Python version (e.g., 3.9): 3.10.12
#- OS (e.g., Linux): CentOS Linux release 7.4.1708 (Core)
#- CUDA/cuDNN version: CUDA 11.7 + cuDNN 8.5
#- GPU models and configuration: 2 ranks in one A40 GPU
#- How you installed Lightning(`conda`, `pip`, source): conda

More info

No response

The text was updated successfully, but these errors were encountered:

RitaQian-westlake added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 22, 2024

github-actions bot added the ver: 2.1.x label Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu training is much lower than single gpu (due to additional processes?) #19796

Multi-gpu training is much lower than single gpu (due to additional processes?) #19796

RitaQian-westlake commented Apr 22, 2024

Multi-gpu training is much lower than single gpu (due to additional processes?) #19796

Multi-gpu training is much lower than single gpu (due to additional processes?) #19796

Comments

RitaQian-westlake commented Apr 22, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info