Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix finalization of ProcessSetTable and some test flakiness with PyTorch 1.10.1 #3351

Merged
merged 3 commits into from Jan 10, 2022

Commits on Jan 7, 2022

  1. Reset next_id_ properly when finalizing ProcessSetTable

    In debug mode with Gloo  an assertion would fail otherwise when
    Horovod is reinitialized.
    
    Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
    maxhgerlach committed Jan 7, 2022
    Configuration menu
    Copy the full SHA
    b7b687b View commit details
    Browse the repository at this point in the history
  2. TorchTests: Add a barrier before shutdown in tearDown

    In tests like `test_horovod_alltoall_equal_split_length_error` it would be possible
    that rank 0 has already finished the test function and has triggered shutting down
    Horovod, before rank 1 has had a chance to call `alltoall` (which would exit quickly to
    report an error if the test worked as intended). Rank 1 would then crash.
    
    Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
    maxhgerlach committed Jan 7, 2022
    Configuration menu
    Copy the full SHA
    8278df5 View commit details
    Browse the repository at this point in the history
  3. TorchTests::test_broadcast_state: Add explicit names to broadcast ope…

    …rations
    
    I am not sure, but hangs might have been caused by wrongly associated
    autogenerated names like `broadcast.noname.1114`.
    
    Signed-off-by: Max H. Gerlach <git@maxgerlach.de>
    maxhgerlach committed Jan 7, 2022
    Configuration menu
    Copy the full SHA
    abea4a0 View commit details
    Browse the repository at this point in the history