GitHub - awslabs/optimizing-multitask-training-through-dynamic-pipelines: Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines

Optimizing Multi-task Training through Dynamic Pipelines

Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines (Paper).

During multi-task training, the model commonly receives input sequences of highly different lengths due to the diverse contexts of different tasks. Padding (to the same sequence length) or packing (short examples into long sequences of the same length) is usually adopted to prepare input samples for model training, which is nonetheless not space or computation efficient. This project adopts a dynamic micro-batching approach to tackle sequence length variation. Each input global batch is split into multiple variable-length micro-batches, each of which comprises a (potentially different) number of samples of similar sequence lengths. These micro-batches are efficiently organized into pipelines, facilitating efficient 3D-parallel (data, tensor and pipeline) multi-task model training.

Main features of this project include:

An efficient dynamic programming algorithm to compute the optimal micro-batching plan for each input global batch.
A pipeline schedule robust to variable-sized micro-batches, minimizing pipeline bubbles.
A pipeline executor supporting highly dynamic pipelines (the pipeline schedule, the size and number of micro-batches can vary each iteration), based on an instruction-based abstraction of pipeline operations.
Overlapped execution plan generation with model training.

System Diagram

Getting Started

Dependencies

Redis

The distributed instruction store uses Redis as the underlying key-value store. Redis server needs to be installed on machines participating in training. Our code will setup and initialize a Redis server automatically.

Note: The Redis server is not protected by authentication and may pose security risks. Please make sure that the code is only run in a secure environment.

Python Dependencies

Please see requirements.txt for the required Python packages. Install them by running

pip3 install -r requirements.txt

Installation

Clone this repository and run

pip3 install -e .

Then, build the C++ extensions by running

cd dynapipe/data_opt
make
cd ../memory_opt
python3 setup.py build

Pipeline Instructions

To use this project, the Pipeline Instructions (defined here) needs to be implemented using the intented training framework (e.g., Megatron-LM). A reference implementation of the instructions in Megatron-LM can be found here.

Using this project

Please note that this project is experimental and only tested on integrating with Megatron-LM (please refer to the linked repository for detailed usage).

This project interacts with the training framework mainly through the following two interfaces:

Data Loader

We wrap the micro-batch splitting and execution plan generation process into a DynaPipeDataLoader. It takes the normal PyTorch data loader arguments with a few additional ones. Please see here for the full list of arguments. The returning iterator will generate tuples of micro-batched data and the corresponding execution plan for each iteraton. This iterator is to be used by the pipeline executor. See here for an example of using the DynaPipeDataLoader in Megatron-LM.

Pipeline Executor

The pipeline executor simply reads in execution plans and calls the Pipeline Instruction Implementations. These implementations are registered to the executor through the register_handler function. To run the pipeline executor, simply call the execute function with the corresponding execution plan in each iteration. See here for an example of using the pipeline executor in Megatron-LM.

Environment Variables

Except for the above two interfaces, this project can also be configured through the following environment variables:

DYNAPIPE_KV_HOST: The host IP of the Redis kv store server. Default to 'localhost' (requried for multi-node training).
DYNAPIPE_KV_PORT: The port for the Redis kv store server. Default to 29500.
DYNAPIPE_DEBUG: Logging level. Default to 'INFO'. Set to 'DEBUG' for more detailed logging.
DYNAPIPE_LOGGING_DEBUG_DIR: The directory to store all generated logs.
DYNAPIPE_DEBUG_DUMP_EP_STATS: if set true, dump the generated execution plans, seen sequence lengths, shapes of the generated micro-batches, estimated memory and simulated traces for each iteration during training. Used for debugging and for collecting statistics during our experiments.
DYNAPIPE_DEBUG_DUMP_EP_PREFIX: the directory for dumping the above artifacts.

Code Structure

├── dynapipe
│   : main source folder
│   ├── data_opt
│   │   : code for micro-batch splitting and cost models
│   ├── memory_opt
│   │   : contains the modified cuda caching memory allocator 
│   │     from PyTorch
│   ├── pipe
│   │   : contains implementation of pipeline instructions,
│   │     executor, and the distributed instruction store
│   ├── schedule_opt
│   │   : code for computing pipeline schedule
│   └── utils
│       : other util codes like logger
├── scripts
│   : utility scripts for various purposes 
├── tests
│   : unit tests of different modules

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs/images		docs/images
dynapipe		dynapipe
scripts		scripts
tests		tests
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
THIRD-PARTY-LICENSES		THIRD-PARTY-LICENSES
requirements.txt		requirements.txt
setup.py		setup.py

License

awslabs/optimizing-multitask-training-through-dynamic-pipelines

Folders and files

Latest commit

History

Repository files navigation

Optimizing Multi-task Training through Dynamic Pipelines

System Diagram

Getting Started

Dependencies

Redis

Python Dependencies

Installation

Pipeline Instructions

Using this project

Data Loader

Pipeline Executor

Environment Variables

Code Structure

Security

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages