Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transformer_engine: Fails with Pythia-14m in distributed #368

Closed
kshitij12345 opened this issue May 6, 2024 · 0 comments · Fixed by #379
Closed

transformer_engine: Fails with Pythia-14m in distributed #368

kshitij12345 opened this issue May 6, 2024 · 0 comments · Fixed by #379
Assignees
Labels
bug Something isn't working TransformerEngine

Comments

@kshitij12345
Copy link
Collaborator

kshitij12345 commented May 6, 2024

Running

torchrun --nproc_per_node=2 --nnodes=1 thunder/benchmarks/benchmark_litgpt.py --return_metrics_as_json=True --json_path=/tmp/benchmark_litgpt_data.json --distributed_mode=fsdp --shard_mode=zero3 --model_name=pythia-14m --micro_batch_size=1 --compile=thunder_inductor_transformerengine --nsys_enabled=False --dynamic=False

leads to

An error occurred: UnboundLocalError – local variable 't7' referenced before assignment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working TransformerEngine
Projects
None yet
1 participant