Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reenable apex backward / phantom grad tests without wrecking CI #392

Open
t-vi opened this issue May 10, 2024 · 3 comments
Open

reenable apex backward / phantom grad tests without wrecking CI #392

t-vi opened this issue May 10, 2024 · 3 comments

Comments

@t-vi
Copy link
Collaborator

t-vi commented May 10, 2024

I'm disabling the apex tests because they use excessively much memory.
Not sure whether this is the tests or a bug in the executor.
I'd be happy to see these re-enabled either with lower memory consumption or executed without parallelism.

test cuda memory use test_apex_cross_entropy_backward[cuda-float16] 4 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-bfloat16] 5 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-float32] 6 memory 2.872036933898926
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float16] 7 memory 8.822380542755127
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-bfloat16] 8 memory 9.014111042022705
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float32] 9 memory 9.401249408721924

cc @Borda @crcrpar

@IvanYashchuk
Copy link
Collaborator

What do these numbers mean? What are the units? How much GPU memory is acceptable to use in a test?

@t-vi
Copy link
Collaborator Author

t-vi commented May 11, 2024

The numbers are GB GPU mem. We can run tests needing 8GB, but we should not run them in the parallel setup.
For comparison in #394 I listed all tests (of 7000) that need > 0.6GB GPU memory.

It would be cool if someone (else) could look into whether we need that much memory for softmax testing. If so, we can move the calls of apex / triton crossentropy tests to the "network tests" section of the GPU running (or do this via tagging).

@mruberry
Copy link
Collaborator

triage review —

  • these tests probably don't need to use so much memory, and the amount can be reduced
  • if the memory is needed, we can mark them to be executed serially

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants