New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug in AsyncAllocator memory pools access setting. #67373
base: master
Are you sure you want to change the base?
Conversation
@@ -215,6 +215,15 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator( | |||
static auto* all_pools_ = new std::vector<CUmemoryPool*>(); | |||
static auto* all_ids_ = new std::vector<tsl::PlatformDeviceId>(); | |||
DCHECK(all_pools_->size() == all_ids_->size()); | |||
for (auto& pool_item_ : *all_pools_) { | |||
if (*pool_item_ == pool_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to check my understanding - each device has a default memory pool. Virtual devices on the same physical device will use the same memory pool. So for all virtual devices on the same physical device, we only initialize the pool of this physical device once. In other words, if the PlatformDeviceId
already appears in the all_ids_
, the pool initialization can be skipped. Right?
Can you add some comment for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your understanding is correct. Add comments ahead.
@@ -246,10 +255,11 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator( | |||
// Set the previous pools access to the current GPU. | |||
map.location.id = platform_device_id.value(); | |||
|
|||
VLOG(2) << "Set access to the pool id: " << i | |||
VLOG(2) << "Set access to the pool id: " << (*all_ids_)[i].value() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a temp var to save (*all_ids_)[i].value()
, and use it below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create a var int previous_pool_id = (*all_ids_)[i].value();
.
The original code uses the wrong parameter
i
for cuDeviceCanAccessPeer(), which can cause undefined behavior when(*all_ids_)[i] != platform_device_id
, for example, virtual devices are used. Should use(*all_ids_)[i].value()
to represent the previous pool id.Also, add a judgment to skip the pool initialization process if it's already initialized previously.