New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement multi-pass prefetch for memory efficiency #2566
Conversation
This pull request was exported from Phabricator. Differential Revision: D56908989 |
❌ Deploy Preview for pytorch-fbgemm-docs failed.
|
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
…r will correctly propagate its storage cost Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly. Differential Revision: D57055184
…r will correctly propagate its storage cost (pytorch#2000) Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly. Differential Revision: D57055184
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
…r will correctly propagate its storage cost (pytorch#2000) Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly. Differential Revision: D57055184
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
Summary: ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989
This pull request was exported from Phabricator. Differential Revision: D56908989 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D56908989 |
This pull request has been merged in 578ab67. |
…r will correctly propagate its storage cost (pytorch#2000) Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly. Reviewed By: sarckk Differential Revision: D57055184
…r will correctly propagate its storage cost (pytorch#2000) Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly. Reviewed By: sarckk Differential Revision: D57055184
…r will correctly propagate its storage cost (pytorch#2000) Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly. Reviewed By: sarckk Differential Revision: D57055184
…r will correctly propagate its storage cost (pytorch#2000) Summary: After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly. Reviewed By: sarckk Differential Revision: D57055184
…r will correctly propagate its storage cost (#2000) Summary: Pull Request resolved: #2000 After FBGEMM TBE support multipass prefetch mode (see pytorch/FBGEMM#2566 for the full context), this diff will enable TorchRec to pass it all through via CacheParams, and shard estimator will recognize the memory saving accordingly. Reviewed By: sarckk Differential Revision: D57055184 fbshipit-source-id: 9fc2d05c30d7826f654976b1d85059fb3e9a1aae
Summary:
Context
Memory snapshot shows significant memory usage during prefetch kernels (specifically,
linearize_cache_index
andlru_cache_populate
), which is estimated to be 6x of input sizeAnd unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty.
So we need to lower down the peak prefetch memory usage as much as possible.
MultiPass Prefetch (MPP)
Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is
O(N)
, so we reduce the total prefetched index (N
) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint.Benefit
With this being turned on, the peak memory usage will be dropped from
6 * input_size
to(6 / M) * input_size
, whereM
is the total # of passes being configured.Overhead
Overall, the bigger
M
we configured, the slower we'll be. But the overall overhead is acceptable.What's in the patch?
lru_find_uncached
to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one.Differential Revision: D56908989