You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
HF models with naive pipeline parallelism (only one GPU active at a time. device_map='auto' for the transformers library). lm.rank = 0 and lm.world_size = 1 always, we only launch a single process and it evaluates all instances, with the model split across GPUs.
Option 2: Data Parallel only with HF
we launch using accelerate launch lm_eval --model hf ..., creating 1 process per data-parallel rank (and per GPU). Each rank evaluates a non-overlapping subset of instances. lm.rank = {data parallel rank} and lm.world_size = {num_gpus}.
either run data parallel only -> 1 full model replica per GPU, lm.rank = {data parallel rank} and lm.world_size = {num_gpus}, or run TP+PP but no data-parallel -> lm.rank = 0, lm.world_size = 1.
The current constraints are basically:
We don't currently support a mix of parallelism/sharding and data replication for evaluation. This might be fixable if allowing duplicated ranks when creating requests
we don't currently support multi-node. Once FSDP is integrated with your use-case though, this may come for mostly free as long as ranks get set accurately across all devices used--happy to chat about this.
To integrate multi-node, you'll either have to set up your distributed model in LM subclass initialization / pass an initialized one into the LM constructor, set lm.rank and lm.world_size correctly, and implement the lm.accelerator methods described, ~following what Nemo does. I intend to patch out the lm.accelerator accesses to make these unnecessary though, since they basically just wrap torch distributed functions.
Let me know if anything here is confusing!
The text was updated successfully, but these errors were encountered:
Describing here how currently multi-GPU works in lm-evaluation-harness for integration:
LM objects have the following properties that get used in the main evaluate() loop:
LM.rank
- which data-parallel rank the current python process is on.LM.world_size
: how many total data-parallel processes there are.LM.accelerator
, which requires aLM.accelerator.wait_for_everyone()
andLM.accelerator.gather()
method. (These should be removable by us-- see https://github.com/EleutherAI/lm-evaluation-harness/blob/885f48d62cb41589da4ab5aa9d0b6ace3cffb878/lm_eval/models/nemo_lm.py#L312-L325 for a workaround implemented by Nemo for now though)Option 1: simple (slow) multi-GPU with HF
device_map='auto'
for the transformers library).lm.rank = 0
andlm.world_size = 1
always, we only launch a single process and it evaluates all instances, with the model split across GPUs.Option 2: Data Parallel only with HF
accelerate launch lm_eval --model hf ...
, creating 1 process per data-parallel rank (and per GPU). Each rank evaluates a non-overlapping subset of instances.lm.rank = {data parallel rank}
andlm.world_size = {num_gpus}
.Option 3: Either TP+PP or DP-only with Nemo
lm.rank = {data parallel rank}
andlm.world_size = {num_gpus}
, or run TP+PP but no data-parallel ->lm.rank = 0
,lm.world_size = 1
.The current constraints are basically:
To integrate multi-node, you'll either have to set up your distributed model in LM subclass initialization / pass an initialized one into the LM constructor, set
lm.rank
andlm.world_size
correctly, and implement thelm.accelerator
methods described, ~following what Nemo does. I intend to patch out thelm.accelerator
accesses to make these unnecessary though, since they basically just wrap torch distributed functions.Let me know if anything here is confusing!
The text was updated successfully, but these errors were encountered: