reenable apex backward / phantom grad tests without wrecking CI #392

t-vi · 2024-05-10T08:22:34Z

I'm disabling the apex tests because they use excessively much memory.
Not sure whether this is the tests or a bug in the executor.
I'd be happy to see these re-enabled either with lower memory consumption or executed without parallelism.

test cuda memory use test_apex_cross_entropy_backward[cuda-float16] 4 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-bfloat16] 5 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-float32] 6 memory 2.872036933898926
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float16] 7 memory 8.822380542755127
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-bfloat16] 8 memory 9.014111042022705
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float32] 9 memory 9.401249408721924

cc @Borda @crcrpar

The text was updated successfully, but these errors were encountered:

IvanYashchuk · 2024-05-10T12:20:53Z

What do these numbers mean? What are the units? How much GPU memory is acceptable to use in a test?

t-vi · 2024-05-11T10:09:04Z

The numbers are GB GPU mem. We can run tests needing 8GB, but we should not run them in the parallel setup.
For comparison in #394 I listed all tests (of 7000) that need > 0.6GB GPU memory.

It would be cool if someone (else) could look into whether we need that much memory for softmax testing. If so, we can move the calls of apex / triton crossentropy tests to the "network tests" section of the GPU running (or do this via tagging).

mruberry · 2024-05-13T19:16:38Z

triage review —

these tests probably don't need to use so much memory, and the amount can be reduced
if the memory is needed, we can mark them to be executed serially

t-vi added ci / tests apex memory use labels May 10, 2024

This was referenced May 10, 2024

skip apex backward / phantom grad #393

Merged

Investigate GPU memory intense CI tests #394

Open

Enable apex tests #221

Open

mruberry added triage review autograd and removed triage review labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reenable apex backward / phantom grad tests without wrecking CI #392

reenable apex backward / phantom grad tests without wrecking CI #392

t-vi commented May 10, 2024 •

edited by github-actions bot

IvanYashchuk commented May 10, 2024

t-vi commented May 11, 2024

mruberry commented May 13, 2024

reenable apex backward / phantom grad tests without wrecking CI #392

reenable apex backward / phantom grad tests without wrecking CI #392

Comments

t-vi commented May 10, 2024 • edited by github-actions bot

IvanYashchuk commented May 10, 2024

t-vi commented May 11, 2024

mruberry commented May 13, 2024

t-vi commented May 10, 2024 •

edited by github-actions bot