You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm disabling the apex tests because they use excessively much memory.
Not sure whether this is the tests or a bug in the executor.
I'd be happy to see these re-enabled either with lower memory consumption or executed without parallelism.
test cuda memory use test_apex_cross_entropy_backward[cuda-float16] 4 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-bfloat16] 5 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-float32] 6 memory 2.872036933898926
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float16] 7 memory 8.822380542755127
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-bfloat16] 8 memory 9.014111042022705
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float32] 9 memory 9.401249408721924
The numbers are GB GPU mem. We can run tests needing 8GB, but we should not run them in the parallel setup.
For comparison in #394 I listed all tests (of 7000) that need > 0.6GB GPU memory.
It would be cool if someone (else) could look into whether we need that much memory for softmax testing. If so, we can move the calls of apex / triton crossentropy tests to the "network tests" section of the GPU running (or do this via tagging).
I'm disabling the apex tests because they use excessively much memory.
Not sure whether this is the tests or a bug in the executor.
I'd be happy to see these re-enabled either with lower memory consumption or executed without parallelism.
cc @Borda @crcrpar
The text was updated successfully, but these errors were encountered: