New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refine Bf16 test for deepspeed #17734
Conversation
The documentation is not available anymore as the PR was closed or merged. |
def is_torch_bf16_available(): | ||
return is_torch_bf16_cpu_available() or is_torch_bf16_gpu_available() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you have split it up into 2 specific components - I think this is ambiguous - what does it actually mean from the usage point of view?
say, a user has a cpu supporting bf16, but they are actually planning to use a gpu, which may not support bf16, so this will return True
and then their code will either fail or run really slow.
I know that you haven't added this in this PR, I missed that expansion to add cpu checks in the previous PR created this ambiguity in the first place.
What do you think?
The original is_torch_bf16_available
was just doing gpu checks, so perhaps we deprecate it and alias to is_torch_bf16_gpu_available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def is_torch_bf16_available(...):
warn(deprecated)
return is_torch_bf16_gpu_available(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in the Trainer
for the CPU intel integration so your suggestion will just break that new integration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct, we rename it in the Trainer, as it's incorrect - it should be doing a cpu check as introduced in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do in the next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find. We have to have those checks separate depending on the target device.
is_torch_bf16_available
is currently ambiguous, but if you're in a rush, let's merge this and then fix it in the next PR.
* Refine BF16 check in CPU/GPU * Fixes * Renames
What does this PR do?
This PR refines the
is_torch_b16_available
test in two separate ones for GPU and CPU are the DeepSpeed tests require the GPU bfloat16.