Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Persistent CAGRA kernel #2316

Draft
wants to merge 40 commits into
base: branch-24.06
Choose a base branch
from

Conversation

achirkin
Copy link
Contributor

@achirkin achirkin commented May 14, 2024

An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.

CAGRA throughput @ recall = 0 976 CAGRA Latency @ recall = 0 976

API

In the current implementation, the public API does not change. An extra parameter persistent is added to the ann::cagra::search_params (only valid when algo == SINGLE_CTA).
The persistent kernel is managed by a global runner object in a shared_ptr; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch). When there's no hearbeat in the last few seconds (kLiveInterval), the runner shuts down the kernel and cleans up the associated resources.

An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.

Integration notes

lightweight_uvector

RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple cudaGetDevice calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.

To workaround this, I introduce a lightweight_uvector in /detail/cagra/search_plan.cuh for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version of rmm::uvector: it does not check during resize/destruction whether the current device has changed since construction.
We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.

Shared resource queues / ring buffers

resource_queue_t is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.
We may consider putting this in a common public namespace in raft if we envision more uses for it.

Persistent runner structs

launcher_t and persistent_runner_base_t look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.

Other code changes (solved)

This depends on (includes all the changes from):

The host fills in the work_descriptors, which is in the pinned memory and then arrives at the
input barriers (managed memory, on device) to mark that the descriptors are ready to read.
Then it waits on the comnpletion latch (managed memory, on host).

The device reads the descriptors when the readiness barriers allows that.
The descriptors are read by multiple threads at the same time (hoping for a single coalesced read).
Minimize the host<->device latencies by using host pinned memory and device memory for intra-device comm
…mark loop event sync optional

When using the persistent kernel variant, the calling CPU thread has to synchronize
with the GPU (wait on the completion flag) - i.e. there's no way to use events for this.
As a result, the event recording and sync in the benchmark loop introduce significant latency overheads.
To avoid this, I make the event optional (dependant on the search mode: persistent/original).

Originally, the benchmark used size_t indices, whereas CAGRA operated with uint32_t.
As a result, we had to do a linear mapping (on GPU), which adds a kernel to the benchmark loop,
which goes against the event optimization above.
Hence, I changed the benchmark index type.
Restructure input/output a bit to pad the atomics to 128 bytes.
This reduces the latency/single threaded time by 3x on a PCIe machine.
1) Make the persistent kernel allocate the hashmap in advance.
2) Introduce lightweight_uvector, which does not call any CUDA functions when not needed.
Since the worker and job queues were decoupled, it's not necessary to wait for the job to be
read anymore. As soon as the descriptor handle is read, it can be returned to the queue.
@achirkin achirkin added feature request New feature or request non-breaking Non-breaking change 2 - In Progress Currenty a work in progress labels May 14, 2024
@achirkin achirkin self-assigned this May 14, 2024
@github-actions github-actions bot removed the CMake label May 15, 2024
@cjnolet cjnolet added the 5 - DO NOT MERGE Hold off on merging; see PR for details label May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currenty a work in progress 5 - DO NOT MERGE Hold off on merging; see PR for details cpp feature request New feature or request non-breaking Non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

2 participants