FSDP / multi-GPU compatibility with EAI Eval Harness #951

haileyschoelkopf · 2024-05-08T12:47:36Z

Describing here how currently multi-GPU works in lm-evaluation-harness for integration:

LM objects have the following properties that get used in the main evaluate() loop:

LM.rank - which data-parallel rank the current python process is on.
LM.world_size: how many total data-parallel processes there are.
LM.accelerator, which requires a LM.accelerator.wait_for_everyone() and LM.accelerator.gather() method. (These should be removable by us-- see https://github.com/EleutherAI/lm-evaluation-harness/blob/885f48d62cb41589da4ab5aa9d0b6ace3cffb878/lm_eval/models/nemo_lm.py#L312-L325 for a workaround implemented by Nemo for now though)

Option 1: simple (slow) multi-GPU with HF

HF models with naive pipeline parallelism (only one GPU active at a time. device_map='auto' for the transformers library). lm.rank = 0 and lm.world_size = 1 always, we only launch a single process and it evaluates all instances, with the model split across GPUs.

Option 2: Data Parallel only with HF

we launch using accelerate launch lm_eval --model hf ..., creating 1 process per data-parallel rank (and per GPU). Each rank evaluates a non-overlapping subset of instances. lm.rank = {data parallel rank} and lm.world_size = {num_gpus}.

Option 3: Either TP+PP or DP-only with Nemo

https://github.com/EleutherAI/lm-evaluation-harness#multi-gpu-evaluation-with-nvidia-nemo-models
either run data parallel only -> 1 full model replica per GPU, lm.rank = {data parallel rank} and lm.world_size = {num_gpus}, or run TP+PP but no data-parallel -> lm.rank = 0, lm.world_size = 1.

The current constraints are basically:

We don't currently support a mix of parallelism/sharding and data replication for evaluation. This might be fixable if allowing duplicated ranks when creating requests
we don't currently support multi-node. Once FSDP is integrated with your use-case though, this may come for mostly free as long as ranks get set accurately across all devices used--happy to chat about this.

To integrate multi-node, you'll either have to set up your distributed model in LM subclass initialization / pass an initialized one into the LM constructor, set lm.rank and lm.world_size correctly, and implement the lm.accelerator methods described, ~following what Nemo does. I intend to patch out the lm.accelerator accesses to make these unnecessary though, since they basically just wrap torch distributed functions.

Let me know if anything here is confusing!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP / multi-GPU compatibility with EAI Eval Harness #951

FSDP / multi-GPU compatibility with EAI Eval Harness #951

haileyschoelkopf commented May 8, 2024 •

edited by kartikayk

FSDP / multi-GPU compatibility with EAI Eval Harness #951

FSDP / multi-GPU compatibility with EAI Eval Harness #951

Comments

haileyschoelkopf commented May 8, 2024 • edited by kartikayk

haileyschoelkopf commented May 8, 2024 •

edited by kartikayk