Skip to content

Latest commit

 

History

History
24 lines (21 loc) · 1.71 KB

File metadata and controls

24 lines (21 loc) · 1.71 KB

Run Instruction

  1. Follow steps in README.md
  2. Training script in 2.2 Run this recipe for T5 If running on AzureML,
cd huggingface/script
python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model t5-large --run_config ort

If running locally,

cd huggingface/script
python hf-ort.py --hf_model t5-large --run_config ort --process_count <process_count> --local_run

Performance Comparison

Run configuration PyTorch ORTModule Gain
fp16 262.34 329.02 25.4%
fp16 with deepspeed stage 1 253.25 277.47 9.6%

These numbers are average of samples/sec from 10 runs on ND40rs_v2 VMs (V100 32G x 8), Cuda 11, with stable release onnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl with batch size of 16. Cuda 10.2 option is also available through --use_cu102 flag. Please check dependency details in Dockerfile. We look at the metrics stable_train_samples_per_second in the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller --max_steps value may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust --max_steps to a smaller value. Lastly, we do not recommend running this recipe on [NC] series VMs which uses old architecture (K80).

Convergence

Loss