Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug in AsyncAllocator memory pools access setting. #67373

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

buptzyb
Copy link
Contributor

@buptzyb buptzyb commented May 11, 2024

The original code uses the wrong parameter i for cuDeviceCanAccessPeer(), which can cause undefined behavior when (*all_ids_)[i] != platform_device_id, for example, virtual devices are used. Should use (*all_ids_)[i].value() to represent the previous pool id.

Also, add a judgment to skip the pool initialization process if it's already initialized previously.

@google-ml-butler google-ml-butler bot added the size:S CL Change Size: Small label May 11, 2024
@gbaned gbaned added the comp:xla XLA label May 14, 2024
@gbaned gbaned added this to Assigned Reviewer in PR Queue via automation May 14, 2024
@gbaned gbaned requested a review from cheshire May 14, 2024 05:40
@google-ml-butler google-ml-butler bot added the awaiting review Pull request awaiting review label May 14, 2024
@changhuilin changhuilin self-requested a review May 23, 2024 22:14
@@ -215,6 +215,15 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator(
static auto* all_pools_ = new std::vector<CUmemoryPool*>();
static auto* all_ids_ = new std::vector<tsl::PlatformDeviceId>();
DCHECK(all_pools_->size() == all_ids_->size());
for (auto& pool_item_ : *all_pools_) {
if (*pool_item_ == pool_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check my understanding - each device has a default memory pool. Virtual devices on the same physical device will use the same memory pool. So for all virtual devices on the same physical device, we only initialize the pool of this physical device once. In other words, if the PlatformDeviceId already appears in the all_ids_, the pool initialization can be skipped. Right?

Can you add some comment for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your understanding is correct. Add comments ahead.

@@ -246,10 +255,11 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator(
// Set the previous pools access to the current GPU.
map.location.id = platform_device_id.value();

VLOG(2) << "Set access to the pool id: " << i
VLOG(2) << "Set access to the pool id: " << (*all_ids_)[i].value()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a temp var to save (*all_ids_)[i].value(), and use it below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a var int previous_pool_id = (*all_ids_)[i].value();.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting review Pull request awaiting review comp:xla XLA size:S CL Change Size: Small
Projects
PR Queue
  
Assigned Reviewer
Development

Successfully merging this pull request may close these issues.

None yet

3 participants