Long Compilation with DeepSpeed in the Cloud. #14

tchaton · 2022-11-21T14:55:23Z

When running the dreambooth_component.py in the cloud with the following command:

lightning run app dreambooth_component.py --setup --cloud

It seems the script took almost 20 min from DeepSpeed compilation (13:55:16) to start the training (14:13:28). Locally, this step is almost instant.

100%|██████████| 51/51 [00:17<00:00,  2.97it/s]
100%|██████████| 51/51 [00:17<00:00,  2.98it/s]
100%|██████████| 51/51 [00:17<00:00,  2.97it/s]
100%|██████████| 51/51 [00:17<00:00,  2.97it/s]
100%|██████████| 51/51 [00:17<00:00,  2.97it/s]
[root.finetuner.ws.0] 2022-11-21T13:54:42.019Z /content/venv/lib/python3.8/site-packages/lightning/pytorch/utilities/seed.py:48: LightningDeprecationWarning: `lightning.pytorch.utilities.seed.seed_everything` has been deprecated in v1.8.0 and will be removed in v1.10.0. Please use `lightning.lite.utilities.seed.seed_everything` instead.
[root.finetuner.ws.0] 2022-11-21T13:54:42.019Z   rank_zero_deprecation(
[root.finetuner.ws.0] 2022-11-21T13:54:42.019Z [rank: 0] Global seed set to 42
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Preparing the Model...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Using /home/zeus/.cache/torch_extensions/py38_cu102 as PyTorch extensions root...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Creating extension directory /home/zeus/.cache/torch_extensions/py38_cu102/utils...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Emitting ninja build file /home/zeus/.cache/torch_extensions/py38_cu102/utils/build.ninja...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Building extension module utils...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[root.finetuner.ws.0] 2022-11-21T13:55:00.276Z [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /content/venv/lib/python3.8/site-packages/torch/include -isystem /content/venv/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /content/venv/lib/python3.8/site-packages/torch/include/TH -isystem /content/venv/lib/python3.8/site-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /content/venv/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[root.finetuner.ws.0] 2022-11-21T13:55:00.682Z [2/2] c++ flatten_unflatten.o -shared -L/content/venv/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Loading extension module utils...
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Time to load utils op: 18.031002283096313 seconds
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Rank: 0 partition count [1] and sizes[(859520964, False)] 
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Using /home/zeus/.cache/torch_extensions/py38_cu102 as PyTorch extensions root...
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z No modifications detected for re-loaded extension module utils, skipping build step...
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Loading extension module utils...
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Time to load utils op: 0.00046515464782714844 seconds
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Preparing the Dataloaders...
[root.finetuner.ws.0] 2022-11-21T13:55:15.876Z 2022-11-21 13:55:15.876542: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[root.finetuner.ws.0] 2022-11-21T13:55:15.876Z To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[root.finetuner.ws.0] 2022-11-21T13:55:16.797Z 2022-11-21 13:55:16.797812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
[root.finetuner.ws.0] 2022-11-21T13:55:16.798Z 2022-11-21 13:55:16.797913: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
[root.finetuner.ws.0] 2022-11-21T13:55:16.798Z 2022-11-21 13:55:16.797927: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[root.finetuner.ws.0] 2022-11-21T14:13:28.206Z Step 1/450: 1.3167471885681152
[root.finetuner.ws.0] 2022-11-21T14:13:28.207Z Step 2/450: 0.07824122905731201

The text was updated successfully, but these errors were encountered:

tchaton added the bug Something isn't working label Nov 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Compilation with DeepSpeed in the Cloud. #14

Long Compilation with DeepSpeed in the Cloud. #14

tchaton commented Nov 21, 2022

Long Compilation with DeepSpeed in the Cloud. #14

Long Compilation with DeepSpeed in the Cloud. #14

Comments

tchaton commented Nov 21, 2022