Skip to content

Distributed training experience

Albert Zeyer edited this page Nov 29, 2023 · 7 revisions

This is about distributed training with TensorFlow.

Also see Distributed Computing Overview for a more generic overview.

This could use distributed PyTorch, distributed TensorFlow (returnn/tf/distributed.py in RETURNN, issue #296) or Horovod (RETURNN doc about Horovod) (or a mixture of both). Both technologies would allow a wide range of possible distributed strategies (e.g. synchronous training, async training with parameter servers, async training with frequent syncs, ...).

This could use the new TF dataset pipeline (returnn/tf/data_pipeline.py in RETURNN, issue #292) or the old data pipeline.

This might also need to extend some of the existing implementations (all discussions about extending the code should happen in the corresponding GitHub issues, or on Slack).

We care about several cluster settings with various different hardware:

  • single-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
  • single-node multi-GPU (cluster GPU, fast interlink)
  • multi-node multi-GPU (consumer GPU cards, just TCP/MPI data transfer, slow NFS)
  • multi-node multi-GPU (cluster GPU, fast interlink)
  • AWS settings
  • GCP settings (GPU or also TPU)