New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray tests are flaky #3435
Comments
@richardliaw @tgaddair the If you feel like this test is important, we need to find a different, more stable approach for the assertion. |
Looks like
|
@ashahab @amogkam @richardliaw any idea why this: os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
address_info = ray.init(num_cpus=4, num_gpus=4)
setting = RayExecutor.create_settings(timeout_s=30)
hjob = RayExecutor(
setting, num_hosts=1, num_workers_per_host=4, use_gpu=True)
hjob.start()
all_envs = hjob.execute(lambda _: os.environ.copy()) sometimes may produce |
@ashahab can you take a look please? |
Ray tests have shown to be flaky, especially with GPU (Buildkite CI).
There are two places that cause these flakiness:
ray.available_resources()
at the beginning of the test and compare thedict
againstray.available_resources()
after the test:assert check_resources(original_resources)
. It looks like thatdict
has race conditions: https://buildkite.com/horovod/horovod/builds/7306#cc0600a5-13ed-479d-ba1b-3bb0f6847992/332-436First, the tests should work with any number of GPUs (larger than the minimum required GPUs of 4). Second, the test run on a machine with only 4 GPUs on Buildkite, with only one agent per machine. So there cannot be more than 4 GPUs visible to those tests. It looks like the
RayExecutor
provides that environment variable to the worker and somehow start 8 workers rather than 4.The text was updated successfully, but these errors were encountered: