Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multi-pass prefetch for memory efficiency #2566

Closed
wants to merge 1 commit into from

Conversation

levythu
Copy link
Contributor

@levythu levythu commented May 7, 2024

Summary:

Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, linearize_cache_index and lru_cache_populate), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

MultiPass Prefetch (MPP)

Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is O(N), so we reduce the total prefetched index (N) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

Benefit

With this being turned on, the peak memory usage will be dropped from 6 * input_size to (6 / M) * input_size, where M is the total # of passes being configured.

Overhead
Overall, the bigger M we configured, the slower we'll be. But the overall overhead is acceptable.

  • Efficiency regression: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
    • The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
    • The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
  • Spamming CUDA Launch Queue: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

What's in the patch?

  1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
  2. Modify the lru_find_uncached to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Differential Revision: D56908989

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

Copy link

netlify bot commented May 7, 2024

Deploy Preview for pytorch-fbgemm-docs failed.

Name Link
🔨 Latest commit 3386341
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66469ed5af92290008b814cd

levythu added a commit to levythu/FBGEMM that referenced this pull request May 7, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/torchrec that referenced this pull request May 14, 2024
…r will correctly propagate its storage cost

Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Differential Revision: D57055184
levythu added a commit to levythu/torchrec that referenced this pull request May 14, 2024
…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Differential Revision: D57055184
levythu added a commit to levythu/FBGEMM that referenced this pull request May 15, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 15, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 15, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 15, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 16, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 16, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 16, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 16, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 16, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request May 16, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/torchrec that referenced this pull request May 16, 2024
…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Differential Revision: D57055184
levythu added a commit to levythu/FBGEMM that referenced this pull request May 16, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
levythu added a commit to levythu/FBGEMM that referenced this pull request May 17, 2024
Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
   - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
   - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56908989

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 578ab67.

levythu added a commit to levythu/torchrec that referenced this pull request May 18, 2024
…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184
levythu added a commit to levythu/torchrec that referenced this pull request May 18, 2024
…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184
levythu added a commit to levythu/torchrec that referenced this pull request May 18, 2024
…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184
levythu added a commit to levythu/torchrec that referenced this pull request May 18, 2024
…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184
facebook-github-bot pushed a commit to pytorch/torchrec that referenced this pull request May 18, 2024
…r will correctly propagate its storage cost (#2000)

Summary:
Pull Request resolved: #2000

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184

fbshipit-source-id: 9fc2d05c30d7826f654976b1d85059fb3e9a1aae
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants