Skip to content
This repository has been archived by the owner on Feb 15, 2023. It is now read-only.

Long Compilation with DeepSpeed in the Cloud. #14

Open
tchaton opened this issue Nov 21, 2022 · 0 comments
Open

Long Compilation with DeepSpeed in the Cloud. #14

tchaton opened this issue Nov 21, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@tchaton
Copy link
Contributor

tchaton commented Nov 21, 2022

When running the dreambooth_component.py in the cloud with the following command:

lightning run app dreambooth_component.py --setup --cloud

It seems the script took almost 20 min from DeepSpeed compilation (13:55:16) to start the training (14:13:28). Locally, this step is almost instant.

100%|██████████| 51/51 [00:17<00:00,  2.97it/s]
100%|██████████| 51/51 [00:17<00:00,  2.98it/s]
100%|██████████| 51/51 [00:17<00:00,  2.97it/s]
100%|██████████| 51/51 [00:17<00:00,  2.97it/s]
100%|██████████| 51/51 [00:17<00:00,  2.97it/s]
[root.finetuner.ws.0] 2022-11-21T13:54:42.019Z /content/venv/lib/python3.8/site-packages/lightning/pytorch/utilities/seed.py:48: LightningDeprecationWarning: `lightning.pytorch.utilities.seed.seed_everything` has been deprecated in v1.8.0 and will be removed in v1.10.0. Please use `lightning.lite.utilities.seed.seed_everything` instead.
[root.finetuner.ws.0] 2022-11-21T13:54:42.019Z   rank_zero_deprecation(
[root.finetuner.ws.0] 2022-11-21T13:54:42.019Z [rank: 0] Global seed set to 42
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Preparing the Model...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Using /home/zeus/.cache/torch_extensions/py38_cu102 as PyTorch extensions root...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Creating extension directory /home/zeus/.cache/torch_extensions/py38_cu102/utils...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Emitting ninja build file /home/zeus/.cache/torch_extensions/py38_cu102/utils/build.ninja...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Building extension module utils...
[root.finetuner.ws.0] 2022-11-21T13:54:42.945Z Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[root.finetuner.ws.0] 2022-11-21T13:55:00.276Z [1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /content/venv/lib/python3.8/site-packages/torch/include -isystem /content/venv/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /content/venv/lib/python3.8/site-packages/torch/include/TH -isystem /content/venv/lib/python3.8/site-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /content/venv/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[root.finetuner.ws.0] 2022-11-21T13:55:00.682Z [2/2] c++ flatten_unflatten.o -shared -L/content/venv/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Loading extension module utils...
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Time to load utils op: 18.031002283096313 seconds
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Rank: 0 partition count [1] and sizes[(859520964, False)] 
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Using /home/zeus/.cache/torch_extensions/py38_cu102 as PyTorch extensions root...
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z No modifications detected for re-loaded extension module utils, skipping build step...
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Loading extension module utils...
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Time to load utils op: 0.00046515464782714844 seconds
[root.finetuner.ws.0] 2022-11-21T13:55:13.269Z Preparing the Dataloaders...
[root.finetuner.ws.0] 2022-11-21T13:55:15.876Z 2022-11-21 13:55:15.876542: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
[root.finetuner.ws.0] 2022-11-21T13:55:15.876Z To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[root.finetuner.ws.0] 2022-11-21T13:55:16.797Z 2022-11-21 13:55:16.797812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
[root.finetuner.ws.0] 2022-11-21T13:55:16.798Z 2022-11-21 13:55:16.797913: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
[root.finetuner.ws.0] 2022-11-21T13:55:16.798Z 2022-11-21 13:55:16.797927: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
[root.finetuner.ws.0] 2022-11-21T14:13:28.206Z Step 1/450: 1.3167471885681152
[root.finetuner.ws.0] 2022-11-21T14:13:28.207Z Step 2/450: 0.07824122905731201
@tchaton tchaton added the bug Something isn't working label Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant