Fix a bug in AsyncAllocator memory pools access setting. #67373

buptzyb · 2024-05-11T03:42:48Z

The original code uses the wrong parameter i for cuDeviceCanAccessPeer(), which can cause undefined behavior when (*all_ids_)[i] != platform_device_id, for example, virtual devices are used. Should use (*all_ids_)[i].value() to represent the previous pool id.

Also, add a judgment to skip the pool initialization process if it's already initialized previously.

changhuilin · 2024-05-23T22:31:46Z

third_party/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc

@@ -215,6 +215,15 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator(
  static auto* all_pools_ = new std::vector<CUmemoryPool*>();
  static auto* all_ids_ = new std::vector<tsl::PlatformDeviceId>();
  DCHECK(all_pools_->size() == all_ids_->size());
+  for (auto& pool_item_ : *all_pools_) {
+    if (*pool_item_ == pool_) {


Just to check my understanding - each device has a default memory pool. Virtual devices on the same physical device will use the same memory pool. So for all virtual devices on the same physical device, we only initialize the pool of this physical device once. In other words, if the PlatformDeviceId already appears in the all_ids_, the pool initialization can be skipped. Right?

Can you add some comment for this?

Yes, your understanding is correct. Add comments ahead.

changhuilin · 2024-05-23T22:33:38Z

third_party/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc

@@ -246,10 +255,11 @@ GpuCudaMallocAsyncAllocator::GpuCudaMallocAsyncAllocator(
    // Set the previous pools access to the current GPU.
    map.location.id = platform_device_id.value();

-    VLOG(2) << "Set access to the pool id: " << i
+    VLOG(2) << "Set access to the pool id: " << (*all_ids_)[i].value()


use a temp var to save (*all_ids_)[i].value(), and use it below?

create a var int previous_pool_id = (*all_ids_)[i].value();.

bidirectional memory pools access

f5dd5ec

google-ml-butler bot added the size:S CL Change Size: Small label May 11, 2024

google-ml-butler bot assigned gbaned May 11, 2024

gbaned added the comp:xla XLA label May 14, 2024

gbaned added this to Assigned Reviewer in PR Queue via automation May 14, 2024

gbaned requested a review from cheshire May 14, 2024 05:40

google-ml-butler bot added the awaiting review Pull request awaiting review label May 14, 2024

changhuilin self-requested a review May 23, 2024 22:14

changhuilin reviewed May 23, 2024

View reviewed changes

Add comments and use previous_pool_id var

0318323

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a bug in AsyncAllocator memory pools access setting. #67373

Fix a bug in AsyncAllocator memory pools access setting. #67373

buptzyb commented May 11, 2024

changhuilin May 23, 2024

buptzyb May 24, 2024

changhuilin May 23, 2024

buptzyb May 24, 2024

Fix a bug in AsyncAllocator memory pools access setting. #67373

Are you sure you want to change the base?

Fix a bug in AsyncAllocator memory pools access setting. #67373

Conversation

buptzyb commented May 11, 2024

changhuilin May 23, 2024

Choose a reason for hiding this comment

buptzyb May 24, 2024

Choose a reason for hiding this comment

changhuilin May 23, 2024

Choose a reason for hiding this comment

buptzyb May 24, 2024

Choose a reason for hiding this comment