Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add support for CUDA compressed memory. #1265

Open
veritas9872 opened this issue May 12, 2023 · 1 comment
Open

[FEA] Add support for CUDA compressed memory. #1265

veritas9872 opened this issue May 12, 2023 · 1 comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request

Comments

@veritas9872
Copy link

Is your feature request related to a problem? Please describe.

I would like to ask for RMM support of CUDA compressed memory, a feature available in the A100 and H100 for both DRAM and the L2 cache that allows effective bandwidth to be increased by compressing the data.

Describe the solution you'd like

Unfortunately, there is no way of accessing compressed memory from the standard CUDA Runtime API and users must use the CUDA Device API to allocate compressed memory. See page 30 in the PDF below.
https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21819-optimizing-applications-for-nvidia-ampere-gpu-architecture.pdf

Describe alternatives you've considered

CUDA Python has access to the CUDA Driver API but I cannot figure out how to allocate CUDA Runtime objects such as PyTorch Tensors.

Additional context

Due to the memory bottleneck faced by large language models during inference, compressing the data as it flows through the DRAM/HBM and L2 cache is the simplest method of improving throughput. Also, with the help of filtering techniques such as byte shuffling and bit shuffling, the compression efficiency of even simple compression algorithms can be improved massively.
In light of the massive interest in deploying LLMs in the real world, I believe that having this feature would be very helpful.

@veritas9872 veritas9872 added ? - Needs Triage Need team to review and classify feature request New feature or request labels May 12, 2023
@jrhemstad
Copy link
Contributor

jrhemstad commented May 12, 2023

Compressed memory is currently only available via using cuMemCreate directly. RMM doesn't have any allocator implementations that use cuMemCreate directly and it would be a significant amount of work to implement one.

However, we do have an implementation that uses cudaMemPoolCreate/cudaMallocAsync.

The cudaMemPoolProps struct passed to cudaMemPoolCreate does not currently have an option to enable compression, but it may be possible to add it. I will explore this option internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants