Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP / multi-GPU compatibility with EAI Eval Harness #951

Open
haileyschoelkopf opened this issue May 8, 2024 · 0 comments
Open

FSDP / multi-GPU compatibility with EAI Eval Harness #951

haileyschoelkopf opened this issue May 8, 2024 · 0 comments

Comments

@haileyschoelkopf
Copy link

haileyschoelkopf commented May 8, 2024

Describing here how currently multi-GPU works in lm-evaluation-harness for integration:

LM objects have the following properties that get used in the main evaluate() loop:

Option 1: simple (slow) multi-GPU with HF

  • HF models with naive pipeline parallelism (only one GPU active at a time. device_map='auto' for the transformers library). lm.rank = 0 and lm.world_size = 1 always, we only launch a single process and it evaluates all instances, with the model split across GPUs.

Option 2: Data Parallel only with HF

  • we launch using accelerate launch lm_eval --model hf ..., creating 1 process per data-parallel rank (and per GPU). Each rank evaluates a non-overlapping subset of instances. lm.rank = {data parallel rank} and lm.world_size = {num_gpus}.

Option 3: Either TP+PP or DP-only with Nemo

The current constraints are basically:

  • We don't currently support a mix of parallelism/sharding and data replication for evaluation. This might be fixable if allowing duplicated ranks when creating requests
  • we don't currently support multi-node. Once FSDP is integrated with your use-case though, this may come for mostly free as long as ranks get set accurately across all devices used--happy to chat about this.

To integrate multi-node, you'll either have to set up your distributed model in LM subclass initialization / pass an initialized one into the LM constructor, set lm.rank and lm.world_size correctly, and implement the lm.accelerator methods described, ~following what Nemo does. I intend to patch out the lm.accelerator accesses to make these unnecessary though, since they basically just wrap torch distributed functions.

Let me know if anything here is confusing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant