Ray tests are flaky #3435

EnricoMi · 2022-03-01T11:21:30Z

Ray tests have shown to be flaky, especially with GPU (Buildkite CI).

There are two places that cause these flakiness:

Some tests fetch ray.available_resources() at the beginning of the test and compare the dict against ray.available_resources() after the test: assert check_resources(original_resources). It looks like that dict has race conditions: https://buildkite.com/horovod/horovod/builds/7306#cc0600a5-13ed-479d-ba1b-3bb0f6847992/332-436

       AssertionError: assert {'CPU': 4.0, ...147712.0, ...} == {'object_stor... 9999999984.0}
         Differing items:
         {'object_store_memory': 10000000000.0} != {'object_store_memory': 9999999984.0}
         Left contains 5 more items:
         {'CPU': 4.0,
          'GPU': 4.0,
          'accelerator_type:T4': 1.0,
          'memory': 189132147712.0,...

Some tests see more GPUs than there should be:

       assert len(all_envs[0]["CUDA_VISIBLE_DEVICES"].split(",")) == 4
       assert 8 == 4
         +8
         -4

First, the tests should work with any number of GPUs (larger than the minimum required GPUs of 4). Second, the test run on a machine with only 4 GPUs on Buildkite, with only one agent per machine. So there cannot be more than 4 GPUs visible to those tests. It looks like the RayExecutor provides that environment variable to the worker and somehow start 8 workers rather than 4.

The text was updated successfully, but these errors were encountered:

EnricoMi · 2022-03-01T11:56:37Z

@richardliaw @tgaddair the check_resources method seems to be very flaky (see 1. above), I am going to remove it entirely in #3430: https://github.com/horovod/horovod/pull/3430/files#diff-142c833c54b6f513791b91a64842d417cb4025f79afcd0eca791eefdd2d2847fL71-L79

If you feel like this test is important, we need to find a different, more stable approach for the assertion.

EnricoMi · 2022-03-01T13:04:08Z

Looks like RayExecutor produces a broken CUDA_VISIBLE_DEVICES: https://buildkite.com/horovod/horovod/builds/7308#461d92d2-110c-4539-ab04-703f49478c52/231-323

>       assert len(all_envs[0]["CUDA_VISIBLE_DEVICES"].split(",")) == 4, all_envs[0]["CUDA_VISIBLE_DEVICES"]
E       AssertionError: 0,3,1,2,1,2,0,3
E       assert 8 == 4
E         +8
E         -4

EnricoMi · 2022-03-01T13:12:58Z

@ashahab @amogkam @richardliaw any idea why this:

    os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
    address_info = ray.init(num_cpus=4, num_gpus=4)
    setting = RayExecutor.create_settings(timeout_s=30)
    hjob = RayExecutor(
        setting, num_hosts=1, num_workers_per_host=4, use_gpu=True)
    hjob.start()
    all_envs = hjob.execute(lambda _: os.environ.copy())

sometimes may produce all_envs[0]["CUDA_VISIBLE_DEVICES"] == '0,3,1,2,1,2,0,3'

EnricoMi · 2022-03-05T10:30:11Z

@ashahab can you take a look please?

EnricoMi added the bug label Mar 1, 2022

maxhgerlach mentioned this issue Mar 2, 2022

test_ray.py::test_gpu_ids_num_workers sometimes fails on Buildkite #3357

Closed

maxhgerlach mentioned this issue Jan 27, 2023

Reduce flakiness of Ray GPU tests that check CUDA_VISIBLE_DEVICES #3828

Closed

maxhgerlach added the ray label Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray tests are flaky #3435

Ray tests are flaky #3435

EnricoMi commented Mar 1, 2022 •

edited

EnricoMi commented Mar 1, 2022 •

edited

EnricoMi commented Mar 1, 2022

EnricoMi commented Mar 1, 2022

EnricoMi commented Mar 5, 2022

Ray tests are flaky #3435

Ray tests are flaky #3435

Comments

EnricoMi commented Mar 1, 2022 • edited

EnricoMi commented Mar 1, 2022 • edited

EnricoMi commented Mar 1, 2022

EnricoMi commented Mar 1, 2022

EnricoMi commented Mar 5, 2022

EnricoMi commented Mar 1, 2022 •

edited

EnricoMi commented Mar 1, 2022 •

edited