New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Implement multi-pass prefetch for memory efficiency #2566

Closed

levythu wants to merge 1 commit into pytorch:main from levythu:export-D56908989

Contributor

levythu commented May 7, 2024

Summary:

Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, linearize_cache_index and lru_cache_populate), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

MultiPass Prefetch (MPP)

Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is O(N), so we reduce the total prefetched index (N) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

Benefit

With this being turned on, the peak memory usage will be dropped from 6 * input_size to (6 / M) * input_size, where M is the total # of passes being configured.

Overhead
Overall, the bigger M we configured, the slower we'll be. But the overall overhead is acceptable.

Efficiency regression: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
Spamming CUDA Launch Queue: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

What's in the patch?

Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
Modify the lru_find_uncached to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Differential Revision: D56908989

Contributor

facebook-github-bot commented May 7, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

facebook-github-bot added cla signed fb-exported labels

netlify bot commented May 7, 2024 •

edited

❌ Deploy Preview for pytorch-fbgemm-docs failed.

Name	Link
🔨 Latest commit	`3386341`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66469ed5af92290008b814cd

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

48773b2

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 133e6dc to 48773b2 Compare

May 7, 2024 15:28

Contributor

facebook-github-bot commented May 7, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu mentioned this pull request

Support multi-pass prefetch config in CacheParams, and shard estimator will correctly propagate its storage cost pytorch/torchrec#2000

Closed

levythu added a commit to levythu/torchrec that referenced this pull request


          Support multi-pass prefetch config in CacheParams, and shard estimato…

fff68d5

…r will correctly propagate its storage cost

Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Differential Revision: D57055184

levythu added a commit to levythu/torchrec that referenced this pull request


          Support multi-pass prefetch config in CacheParams, and shard estimato…

…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Differential Revision: D57055184

levythu force-pushed the export-D56908989 branch from 48773b2 to 25de7e0 Compare

May 15, 2024 18:21

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

25de7e0

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

Contributor

facebook-github-bot commented May 15, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

e84feb7

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 25de7e0 to e84feb7 Compare

May 15, 2024 20:42

Contributor

facebook-github-bot commented May 15, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

47823df

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from e84feb7 to 47823df Compare

May 15, 2024 20:43

Contributor

facebook-github-bot commented May 15, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

143e8f7

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 47823df to 143e8f7 Compare

May 15, 2024 21:55

Contributor

facebook-github-bot commented May 15, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

cf76b06

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 143e8f7 to cf76b06 Compare

May 16, 2024 02:53

Contributor

facebook-github-bot commented May 16, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

3e2f269

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from cf76b06 to 3e2f269 Compare

May 16, 2024 02:54

Contributor

facebook-github-bot commented May 16, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 3e2f269 to 3fa2a87 Compare

May 16, 2024 14:25

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

3fa2a87

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 8a1a461 to 0e08aa9 Compare

May 16, 2024 15:02

Contributor

facebook-github-bot commented May 16, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

6985d2b

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 0e08aa9 to 6985d2b Compare

May 16, 2024 15:03

Contributor

facebook-github-bot commented May 16, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

1036df2

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 6985d2b to 1036df2 Compare

May 16, 2024 15:13

Contributor

facebook-github-bot commented May 16, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

7bff08d

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 1036df2 to 7bff08d Compare

May 16, 2024 15:15

Contributor

facebook-github-bot commented May 16, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

levythu added a commit to levythu/torchrec that referenced this pull request


          Support multi-pass prefetch config in CacheParams, and shard estimato…

f8ebb59

…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Differential Revision: D57055184

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

86107ca

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 7bff08d to 86107ca Compare

May 16, 2024 19:34

Contributor

facebook-github-bot commented May 16, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 86107ca to 2b4eedf Compare

May 17, 2024 00:03

levythu added a commit to levythu/FBGEMM that referenced this pull request


          Implement multi-pass prefetch for memory efficiency (pytorch#2566)

2b4eedf

Summary:

## Context

Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size

And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.

So we need to lower down the peak prefetch memory usage as much as possible.

## MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.

**Benefit**

With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured.

**Overhead**
Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable.

- **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes.
- The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes.
- The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible.
- **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass.

## What's in the patch?
1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility
2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.

Reviewed By: sryap

Differential Revision: D56908989

levythu force-pushed the export-D56908989 branch from 2b4eedf to 3386341 Compare

May 17, 2024 00:03

Contributor

facebook-github-bot commented May 17, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

1 similar comment

Contributor

facebook-github-bot commented May 17, 2024

This pull request was exported from Phabricator. Differential Revision: D56908989

facebook-github-bot closed this in

578ab67

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented May 17, 2024

This pull request has been merged in 578ab67.

levythu added a commit to levythu/torchrec that referenced this pull request


          Support multi-pass prefetch config in CacheParams, and shard estimato…

4f6adf4

…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184

levythu added a commit to levythu/torchrec that referenced this pull request


          Support multi-pass prefetch config in CacheParams, and shard estimato…

3015ad5

…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184

levythu added a commit to levythu/torchrec that referenced this pull request


          Support multi-pass prefetch config in CacheParams, and shard estimato…

8bd2f2c

…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184

levythu added a commit to levythu/torchrec that referenced this pull request


          Support multi-pass prefetch config in CacheParams, and shard estimato…

0e035d6

…r will correctly propagate its storage cost (pytorch#2000)

Summary:

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184

facebook-github-bot pushed a commit to pytorch/torchrec that referenced this pull request


          Support multi-pass prefetch config in CacheParams, and shard estimato…

0aec3fa

…r will correctly propagate its storage cost (#2000)

Summary:
Pull Request resolved: #2000

After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly.

Reviewed By: sarckk

Differential Revision: D57055184

fbshipit-source-id: 9fc2d05c30d7826f654976b1d85059fb3e9a1aae

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment