-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Persistent CAGRA kernel #2316
Draft
achirkin
wants to merge
40
commits into
rapidsai:branch-24.06
Choose a base branch
from
achirkin:fea-persistent-cagra
base: branch-24.06
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+1,154
−98
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The host fills in the work_descriptors, which is in the pinned memory and then arrives at the input barriers (managed memory, on device) to mark that the descriptors are ready to read. Then it waits on the comnpletion latch (managed memory, on host). The device reads the descriptors when the readiness barriers allows that. The descriptors are read by multiple threads at the same time (hoping for a single coalesced read).
Minimize the host<->device latencies by using host pinned memory and device memory for intra-device comm
…ueue submitting with worker releasing
…safety related to the runner.
…mark loop event sync optional When using the persistent kernel variant, the calling CPU thread has to synchronize with the GPU (wait on the completion flag) - i.e. there's no way to use events for this. As a result, the event recording and sync in the benchmark loop introduce significant latency overheads. To avoid this, I make the event optional (dependant on the search mode: persistent/original). Originally, the benchmark used size_t indices, whereas CAGRA operated with uint32_t. As a result, we had to do a linear mapping (on GPU), which adds a kernel to the benchmark loop, which goes against the event optimization above. Hence, I changed the benchmark index type.
Restructure input/output a bit to pad the atomics to 128 bytes. This reduces the latency/single threaded time by 3x on a PCIe machine.
1) Make the persistent kernel allocate the hashmap in advance. 2) Introduce lightweight_uvector, which does not call any CUDA functions when not needed.
… the throughput when .pop is the bottleneck
Since the worker and job queues were decoupled, it's not necessary to wait for the job to be read anymore. As soon as the descriptor handle is read, it can be returned to the queue.
achirkin
added
feature request
New feature or request
non-breaking
Non-breaking change
2 - In Progress
Currenty a work in progress
labels
May 14, 2024
…the 'persistent' search parameter
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2 - In Progress
Currenty a work in progress
5 - DO NOT MERGE
Hold off on merging; see PR for details
cpp
feature request
New feature or request
non-breaking
Non-breaking change
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
An experimental version of the single-cta CAGRA kernel that run persistently while allowing many CPU threads submit the queries in small batches very efficiently.
API
In the current implementation, the public API does not change. An extra parameter
persistent
is added to theann::cagra::search_params
(only valid whenalgo == SINGLE_CTA
).The persistent kernel is managed by a global runner object in a
shared_ptr
; the first CPU thread to call the kernel spawns the runner, subsequent calls/threads only update a global "heartbeat" atomic variable (runner_base_t::last_touch
). When there's no hearbeat in the last few seconds (kLiveInterval
), the runner shuts down the kernel and cleans up the associated resources.An alternative solution would be to control the kernel explicitly, in a client-server style. This would be more controllable, but would require significant re-thinking of the RAFT/cuVS API.
Integration notes
lightweight_uvector
RMM memory resources and device buffers are not zero-cost, even when the allocation size is zero (a common pattern for conditionally-used buffers). They do at least couple
cudaGetDevice
calls during initialization. Normally, the overhead of this is negligible. However, when the number of concurrent threads is large (hundreds of threads), any CUDA call can become a bottleneck due to a single mutex guarding a critical section somewhere in the driver.To workaround this, I introduce a
lightweight_uvector
in/detail/cagra/search_plan.cuh
for several buffers used in CAGRA kernels. This is a stripped "multi-device-unsafe" version ofrmm::uvector
: it does not check during resize/destruction whether the current device has changed since construction.We may consider putting this in a common folder to use across other RAFT/cuVS algorithms.
Shared resource queues / ring buffers
resource_queue_t
is an atomic counter-based ring buffer used to distribute the worker resources (CTAs) and pre-allocated job descriptors across CPU I/O threads.We may consider putting this in a common public namespace in raft if we envision more uses for it.
Persistent runner structs
launcher_t
andpersistent_runner_base_t
look like they could be abstracted from the cagra kernel and re-used in other algos. The code in its current state, however, is not ready for this.Other code changes (solved)
This depends on (includes all the changes from):