Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray tests are flaky #3435

Open
EnricoMi opened this issue Mar 1, 2022 · 4 comments
Open

Ray tests are flaky #3435

EnricoMi opened this issue Mar 1, 2022 · 4 comments

Comments

@EnricoMi
Copy link
Collaborator

EnricoMi commented Mar 1, 2022

Ray tests have shown to be flaky, especially with GPU (Buildkite CI).

There are two places that cause these flakiness:

  1. Some tests fetch ray.available_resources() at the beginning of the test and compare the dict against ray.available_resources() after the test: assert check_resources(original_resources). It looks like that dict has race conditions: https://buildkite.com/horovod/horovod/builds/7306#cc0600a5-13ed-479d-ba1b-3bb0f6847992/332-436
       AssertionError: assert {'CPU': 4.0, ...147712.0, ...} == {'object_stor... 9999999984.0}
         Differing items:
         {'object_store_memory': 10000000000.0} != {'object_store_memory': 9999999984.0}
         Left contains 5 more items:
         {'CPU': 4.0,
          'GPU': 4.0,
          'accelerator_type:T4': 1.0,
          'memory': 189132147712.0,...
  1. Some tests see more GPUs than there should be:
       assert len(all_envs[0]["CUDA_VISIBLE_DEVICES"].split(",")) == 4
       assert 8 == 4
         +8
         -4

First, the tests should work with any number of GPUs (larger than the minimum required GPUs of 4). Second, the test run on a machine with only 4 GPUs on Buildkite, with only one agent per machine. So there cannot be more than 4 GPUs visible to those tests. It looks like the RayExecutor provides that environment variable to the worker and somehow start 8 workers rather than 4.

@EnricoMi EnricoMi added the bug label Mar 1, 2022
@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Mar 1, 2022

@richardliaw @tgaddair the check_resources method seems to be very flaky (see 1. above), I am going to remove it entirely in #3430: https://github.com/horovod/horovod/pull/3430/files#diff-142c833c54b6f513791b91a64842d417cb4025f79afcd0eca791eefdd2d2847fL71-L79

If you feel like this test is important, we need to find a different, more stable approach for the assertion.

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Mar 1, 2022

Looks like RayExecutor produces a broken CUDA_VISIBLE_DEVICES: https://buildkite.com/horovod/horovod/builds/7308#461d92d2-110c-4539-ab04-703f49478c52/231-323

>       assert len(all_envs[0]["CUDA_VISIBLE_DEVICES"].split(",")) == 4, all_envs[0]["CUDA_VISIBLE_DEVICES"]
E       AssertionError: 0,3,1,2,1,2,0,3
E       assert 8 == 4
E         +8
E         -4

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Mar 1, 2022

@ashahab @amogkam @richardliaw any idea why this:

    os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
    address_info = ray.init(num_cpus=4, num_gpus=4)
    setting = RayExecutor.create_settings(timeout_s=30)
    hjob = RayExecutor(
        setting, num_hosts=1, num_workers_per_host=4, use_gpu=True)
    hjob.start()
    all_envs = hjob.execute(lambda _: os.environ.copy())

sometimes may produce all_envs[0]["CUDA_VISIBLE_DEVICES"] == '0,3,1,2,1,2,0,3'

@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Mar 5, 2022

@ashahab can you take a look please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants