Investigate GPU memory intense CI tests #394

t-vi · 2024-05-10T09:06:52Z

We should look at how these are run. The numebrs are GB GPU memory, listing all that are > 0.6. (We could change to >0.5 as the threshold, but it does not matter).

Maybe we should enforce the threshold of GPU memory use for the parallel tests.

test cuda memory use test_apex_cross_entropy[cuda-float32] 1 memory 0.7668776512145996
test cuda memory use test_populate_grads_nanogpt_torch_cuda_float32 1065 memory 2.0184102058410645
test cuda memory use test_populate_grads_nanogpt_nvfuser_cuda_float32 1066 memory 2.0110630989074707
test cuda memory use test_nanogpt_complete_torch_cuda_float32 1572 memory 0.7405362129211426
test cuda memory use test_nanogpt_complete_nvfuser_cuda_float32 1573 memory 0.7601151466369629
test cuda memory use test_nanogpt_complete_autograd_torch_cuda_float32 1575 memory 1.6154489517211914
test cuda memory use test_nanogpt_complete_autograd_nvfuser_cuda_float32 1576 memory 1.6141061782836914
test cuda memory use test_nanogpt_complete_cudagraphs_torch_cuda_float32 1577 memory 1.05192232131958
test cuda memory use test_nanogpt_complete_cudagraphs_nvfuser_cuda_float32 1578 memory 1.0715012550354004
test cuda memory use test_triton_cross_entropy[cuda-float16] 7555 memory 0.8328347206115723
test cuda memory use test_triton_cross_entropy[cuda-bfloat16] 7556 memory 0.8328347206115723
test cuda memory use test_triton_cross_entropy[cuda-float32] 7557 memory 1.0246472358703613
test cuda memory use test_triton_cross_entropy[cuda-float64] 7558 memory 2.1767783164978027
test cuda memory use test_triton_cross_entropy_vs_torch_consistency[cuda-float32] 7561 memory 0.7137904167175293
test cuda memory use test_triton_cross_entropy_vs_torch_consistency[cuda-float64] 7562 memory 2.0997800827026367

I will probably look at erroring on tests taking too much memory and are not run separately, and it could be that the limit might be 0.6GB.

If we find that the apex (#392) and triton cross entropy tests need the operand sizes, we should move them to be executed separately, either using the current setup (for test_networks) or a mechanism like #219 .

These have been disabled already in #393 and I filed #392.

test cuda memory use test_apex_cross_entropy_backward[cuda-float16] 4 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-bfloat16] 5 memory 1.721745491027832
test cuda memory use test_apex_cross_entropy_backward[cuda-float32] 6 memory 2.872036933898926
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float16] 7 memory 8.822380542755127
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-bfloat16] 8 memory 9.014111042022705
test cuda memory use test_apex_cross_entropy_phantom_grad[cuda-float32] 9 memory 9.401249408721924

cc @Borda

The text was updated successfully, but these errors were encountered:

t-vi · 2024-05-11T10:12:44Z

To facilitate discussion, I generated these numbers by adding thunder/tests/conftest.py. If raising an error there leads to the test failing, we could use a similar setup for enforcing a GPU memory limit in the parallel tests.

import pytest
import torch

cnt = 0

@pytest.fixture(autouse=True)
def gpu_memory(request):
    global cnt
    cnt += 1
    yield
    if torch.cuda.is_available():
        print("\ntest cuda memory use", request.node.name, cnt, "memory", torch.cuda.max_memory_allocated() / 2**30)
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()

t-vi added ci / tests memory use labels May 10, 2024

t-vi mentioned this issue May 11, 2024

reenable apex backward / phantom grad tests without wrecking CI #392

Open

mruberry added triage review and removed triage review labels May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate GPU memory intense CI tests #394

Investigate GPU memory intense CI tests #394

t-vi commented May 10, 2024 •

edited

t-vi commented May 11, 2024

Investigate GPU memory intense CI tests #394

Investigate GPU memory intense CI tests #394

Comments

t-vi commented May 10, 2024 • edited

t-vi commented May 11, 2024

t-vi commented May 10, 2024 •

edited