Distributed training experience

This is about distributed training with TensorFlow.

Also see Distributed Computing Overview for a more generic overview.

This could use distributed PyTorch, distributed TensorFlow (returnn/tf/distributed.py in RETURNN, issue #296) or Horovod (RETURNN doc about Horovod) (or a mixture of both). Both technologies would allow a wide range of possible distributed strategies (e.g. synchronous training, async training with parameter servers, async training with frequent syncs, ...).

This could use the new TF dataset pipeline (returnn/tf/data_pipeline.py in RETURNN, issue #292) or the old data pipeline.

This might also need to extend some of the existing implementations (all discussions about extending the code should happen in the corresponding GitHub issues, or on Slack).

We care about several cluster settings with various different hardware:

single-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
single-node multi-GPU (cluster GPU, fast interlink)
multi-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
multi-node multi-GPU (cluster GPU, fast interlink)
AWS settings
GCP settings (GPU or also TPU)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training experience

Clone this wiki locally