OOM error in building ARM wheel #1812

tinglvv · 2024-05-07T04:25:42Z

As part of process to add CUDA ARM nightly wheel, seeing OOM Error while building flash_attn in adding the https://github.com/pytorch/builder/pull/1775/files to nightly CI.

2024-04-26T02:20:01.5211732Z [6579/6896] Building CUDA object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o�[K
2024-04-26T02:20:01.5215252Z �[31mFAILED: �[0mcaffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o 
2024-04-26T02:20:01.5250703Z /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_CUSPARSELT -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -DTORCH_ASSERT_NO_OPERATORS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/aten/src/THC -I/pytorch/aten/src/ATen/cuda -I/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/pytorch/build/caffe2/aten/src -I/pytorch/aten/src/ATen/.. -I/pytorch/build/nccl/include -I/pytorch/c10/cuda/../.. -I/pytorch/c10/.. -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -isystem /pytorch/build/third_party/gloo -isystem /pytorch/cmake/../third_party/gloo -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/third_party/gemmlowp -isystem /pytorch/third_party/neon2sse -isystem /pytorch/third_party/XNNPACK/include -isystem /pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /pytorch/third_party/ideep/include -isystem /pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_50,code=sm_50 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-unused-function,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-maybe-uninitialized -Wno-deprecated-copy -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o.d -x cu -c /pytorch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim192_bf16_sm80.cu.o
2024-04-26T02:20:01.5283695Z /pytorch/aten/src/ATen/../../../third_party/cutlass/include/cute/layout.hpp(988): catastrophic error: out of memory

Error link - https://github.com/pytorch/pytorch/actions/runs/8840652730/job/24276381274?pr=124112.
Relevant PR for above error - pytorch/pytorch#124112.

Tried set MAX_JOBS=4 (default is 6), no OOM error, but build takes >7 hours
Link - https://github.com/pytorch/pytorch/actions/runs/8970425814/job/24633947792

Is there a way to set MAX_JOBS to only building flash_attn?

cc @atalman @malfet @Aidyn-A @nWEIdia @ptrblck

The text was updated successfully, but these errors were encountered:

tinglvv closed this as completed May 7, 2024

tinglvv reopened this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM error in building ARM wheel #1812

OOM error in building ARM wheel #1812

tinglvv commented May 7, 2024 •

edited

OOM error in building ARM wheel #1812

OOM error in building ARM wheel #1812

Comments

tinglvv commented May 7, 2024 • edited

tinglvv commented May 7, 2024 •

edited