Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink/Refactor Horovod Testing #11975

Closed
krshrimali opened this issue Feb 18, 2022 · 2 comments · Fixed by #16141
Closed

Rethink/Refactor Horovod Testing #11975

krshrimali opened this issue Feb 18, 2022 · 2 comments · Fixed by #16141

Comments

@krshrimali
Copy link
Contributor

krshrimali commented Feb 18, 2022

Proposed refactor

Tests written for horovod strategy, might be outdated as they were mostly written ~2 years back.

  • A lot of tests just check if the horovod run finished without any errors. The test purpose, however, may not be satisfied.
  • accelerator="auto" can be used, wherever possible to avoid separate tests for cpu and gpu devices.
  • For some tests, we don't need to pass default_root_dir and weights_save_path like arguments, and just pass the relevant arg for that test (like gradient_clip_algorithm).

Motivation

While working on #11911, @carmocca explained how these tests can be refactored, and creating an issue to rethink the strategy for testing horovod is a good idea.

Pitch

Welcoming comments and discussions on this one.

An example:

  • test_horovod_cpu_clip_grad_by_value just tests if the horovod run finished without any errors. It doesn't check if gradient_clip_val was correctly used. We can avoid running the process, and instead just check if gradient_clip_val served its purpose.
  • Minor changes were made to the tests here as well: https://github.com/PyTorchLightning/pytorch-lightning/pull/11911/files.

If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @carmocca @ananthsub @justusschock @awaelchli @rohitgr7 @Borda @akihironitta @kaushikb11

@carmocca
Copy link
Member

carmocca commented Feb 22, 2022

@carmocca
Copy link
Member

carmocca commented Apr 6, 2022

Linked to this topic is the idea of upstreaming the Horovod strategy, which means we would remove all these tests from this CI.

@carmocca carmocca added this to the future milestone Apr 6, 2022
@carmocca carmocca assigned Borda and unassigned Borda Apr 6, 2022
@Borda Borda self-assigned this Apr 6, 2022
@carmocca carmocca removed this from the future milestone Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants