[WIP] Persistent CAGRA kernel #2316

achirkin · 2024-05-14T13:30:55Z

An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.

API

In the current implementation, the public API does not change. An extra parameter persistent is added to the ann::cagra::search_params (only valid when algo == SINGLE_CTA).
The persistent kernel is managed by a global runner object in a shared_ptr; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch). When there's no hearbeat in the last few seconds (kLiveInterval), the runner shuts down the kernel and cleans up the associated resources.

An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.

Integration notes

lightweight_uvector

RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple cudaGetDevice calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.

To workaround this, I introduce a lightweight_uvector in /detail/cagra/search_plan.cuh for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version of rmm::uvector: it does not check during resize/destruction whether the current device has changed since construction.
We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.

Shared resource queues / ring buffers

resource_queue_t is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.
We may consider putting this in a common public namespace in raft if we envision more uses for it.

Persistent runner structs

launcher_t and persistent_runner_base_t look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.

Other code changes (solved)

This depends on (includes all the changes from):

The host fills in the work_descriptors, which is in the pinned memory and then arrives at the input barriers (managed memory, on device) to mark that the descriptors are ready to read. Then it waits on the comnpletion latch (managed memory, on host). The device reads the descriptors when the readiness barriers allows that. The descriptors are read by multiple threads at the same time (hoping for a single coalesced read).

Minimize the host<->device latencies by using host pinned memory and device memory for intra-device comm

…ueue submitting with worker releasing

…safety related to the runner.

…mark loop event sync optional When using the persistent kernel variant, the calling CPU thread has to synchronize with the GPU (wait on the completion flag) - i.e. there's no way to use events for this. As a result, the event recording and sync in the benchmark loop introduce significant latency overheads. To avoid this, I make the event optional (dependant on the search mode: persistent/original). Originally, the benchmark used size_t indices, whereas CAGRA operated with uint32_t. As a result, we had to do a linear mapping (on GPU), which adds a kernel to the benchmark loop, which goes against the event optimization above. Hence, I changed the benchmark index type.

Restructure input/output a bit to pad the atomics to 128 bytes. This reduces the latency/single threaded time by 3x on a PCIe machine.

1) Make the persistent kernel allocate the hashmap in advance. 2) Introduce lightweight_uvector, which does not call any CUDA functions when not needed.

…r of mutex locks

… load

… contention

…nel.

…at atomic

… the throughput when .pop is the bottleneck

Since the worker and job queues were decoupled, it's not necessary to wait for the job to be read anymore. As soon as the descriptor handle is read, it can be returned to the queue.

…ning type

…age by threads

…h_params struct

…ent-cagra

…the 'persistent' search parameter

…ent-cagra

…e merge

achirkin added 30 commits May 14, 2024 10:57

Sync via barriers

f8d25c1

Use simple atomics for synchronization.

a1a091c

Minimize the host<->device latencies by using host pinned memory and device memory for intra-device comm

Added launcher_t - a helper state machine struct to interleave work q…

e6ad7b6

…ueue submitting with worker releasing

Initialize the kernel runner in a separate thread and improve thread …

0137dd4

…safety related to the runner.

Added small memory sync optimizations

25dda44

Slightly increase occupancy to improve QPS

efd7966

Align input and sync variables with cache lines to avoid cache conflicts

e7c35df

Restructure input/output a bit to pad the atomics to 128 bytes. This reduces the latency/single threaded time by 3x on a PCIe machine.

Optimize the waiting for the input inside the kernel.

7089ed8

cagra wrapper: avoid constructing rmm uvectors when not needed

259e5ec

Avoid any calls to RMM in IO threads.

63f996a

1) Make the persistent kernel allocate the hashmap in advance. 2) Introduce lightweight_uvector, which does not call any CUDA functions when not needed.

Use atomics on the persistent runner (shared_ptr) to reduce the numbe…

3d1011d

…r of mutex locks

Remove the shared state and the mutex lock from NVTX helpers

8b19907

Split the sync queue in two: job descriptors and idle worker handles

4d2b8d5

Add the third-party atomic_queue headers for easier testing

83355ab

Tweak the CPU waiting behavior to avoid busy-spinning

f177a81

Add a single-threaded deque for pending_reads to reduce the cpu/cache…

db5b002

… load

ann_bench: minimize chances for GPU sync between benchmark cases

c748160

Fix OOB bugs revealed on GH

d51729c

Add a thread-local weak_ptr for the runner to further reduce possible…

9dd3d32

… contention

Keep result buffers between runs to avoid blocking the persistent ker…

4361a5e

…nel.

Avoid an extra layer of atomics on the persistent runner (shared_ptr)

e96cc0f

Reducing congestions: avoid too many writes to the last_touch/heartbe…

c86dfcf

…at atomic

Make a custom implementation of the shared resource queue to optimize…

aaba912

… the throughput when .pop is the bottleneck

Add expectation-based sleep to the waiting loop

8a4ff2e

Make the gpu worker report reading the handle is done earlier.

732072d

Since the worker and job queues were decoupled, it's not necessary to wait for the job to be read anymore. As soon as the descriptor handle is read, it can be returned to the queue.

Move the last_touch initialization into the constructor of the contai…

7450f6f

…ning type

Modify the resource queue to never loop on head/tail counters

8920dfc

Replace yield() with a smarter, work-aware pause() to ease the CPU us…

ba78957

…age by threads

achirkin added 2 commits May 14, 2024 11:03

Expose thread_block_size parameter

304a864

Make the 'persistent' parameter in the search_params

0879955

achirkin added feature request New feature or request non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels May 14, 2024

achirkin self-assigned this May 14, 2024

github-actions bot added cpp CMake labels May 14, 2024

achirkin added 2 commits May 14, 2024 15:43

Update the parameter parser to use the 'persistent' flag in the searc…

affdcb2

…h_params struct

Merge remote-tracking branch 'rapidsai/branch-24.06' into fea-persist…

56195f5

…ent-cagra

github-actions bot removed the CMake label May 15, 2024

achirkin and others added 5 commits May 15, 2024 08:23

Fix the uses_stream() not adapted to the previous change introducing …

a48d8f8

…the 'persistent' search parameter

Merge branch 'branch-24.06' into fea-persistent-cagra

6a1e5f1

Merge remote-tracking branch 'rapidsai/branch-24.06' into fea-persist…

c408dae

…ent-cagra

Merge branch 'branch-24.06' into fea-persistent-cagra

cf26a2b

Recover the uses_stream() function in the cagra_wrapper after the cod…

6079cc9

…e merge

cjnolet added the 5 - DO NOT MERGE Hold off on merging; see PR for details label May 17, 2024

Merge branch 'branch-24.06' into fea-persistent-cagra

8dd4714

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Persistent CAGRA kernel #2316

[WIP] Persistent CAGRA kernel #2316

achirkin commented May 14, 2024 •

edited

[WIP] Persistent CAGRA kernel #2316

Are you sure you want to change the base?

[WIP] Persistent CAGRA kernel #2316

Conversation

achirkin commented May 14, 2024 • edited

API

Integration notes

lightweight_uvector

Shared resource queues / ring buffers

Persistent runner structs

Other code changes (solved)

achirkin commented May 14, 2024 •

edited