-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement Proposal: Simplifying Token Calculation for High-Frequency Append-Log Style Operations #974
Comments
I'm reluctant to expose such functions publicly - I'd rather think about the solution that avoids the allocation while preserving type safety. In order to decrease the impact for now you could use By the way, did you benchmark the performance impact of batches? Does it actually improve performance compared to executing the queries in parallel?
You are talking about
|
One more thing: does your proposed change actually let you do what you want? In order to use |
@Lorak-mmk yes, what I do is a serialize first my rows and want to send it over calculate_token_untyped.
In our system, we are tasked with achieving a throughput requirement of 125MB/s. It is imperative for us to maintain low latency throughout our operations. We have identified that any delays introduced by ScyllaDB in rerouting have a significant impact on our overall latency performance. To address this concern and optimize our system for efficiency, we are exploring the implementation of parallel batch processing. Our strategy involves assigning dedicated threads to handle batch inserts for specific nodes. By segmenting the workload in this manner, we aim to minimize the impact of rerouting on latency and ensure consistent performance across the system. In scylladb protocol, each batch frame is capable of accommodating up to 256MB of data and supports Lz4 compression. Batching data reduces entropy since we batch related data together thus enhancing compression ratios when utilizing algorithms such as LZ4. |
I tried to use this function, however, it requires a serialized partition key which is only accessible through a I have to go through a |
|
nit: LZ4 does not have entropy encoding. |
I meant that as data have less entropy, the compression ratio will be higher. I was not refering the LZ4 entropyless encoding. |
It looks like #738 may be a more direct API than #975 for what you're trying to achieve. |
Enhancement Proposal: Simplifying Token Calculation for High-Frequency Append-Log Style Operations
Overview:
In our current project utilizing ScyllaDB, we're implementing a high-frequency, append-log style architecture to handle concurrent append requests. To optimize performance and minimize network traffic, we're batching these requests similar to how Kafka API operates, sending batches to ScyllaDB every 10 milliseconds.
To ensure efficient batching and minimize network overhead, it's crucial to group insert requests that will ultimately end up on the same node within ScyllaDB. This necessitates the computation of tokens for each insert statement, enabling us to determine their placement within the token ring.
Current Challenge:
Presently, the existing API poses challenges in efficiently computing the token of a
PreparedStatement
without incurring significant performance overhead. The process involves invokingSession::calculate_token
, which necessitates serializing a row (resulting in memory allocation), extracting the partition key, and then computing the token. Subsequently, when batching these statements usingSession::batch
, each row undergoes serialization again, effectively doubling memory allocation and serialization overhead.Immediate Solution
To streamline this process and enhance performance, we propose making Session::calculate_token_untyped public instead of keeping it pub(crate). By exposing this method publicly, we can pre-serialize every row, thereby reusing the serialization results to compute tokens and seamlessly integrate them into our batching process.
Additionnal Note
In addition to the proposed enhancement of making
Session::calculate_token_untyped
public, we suggest making thePartitionHasher
publicly accessible as well. This would empower users to compute results in advance without having to go through the serialization process ofSerializeRow
andPreparedStatement
.Considering that many ScyllaDB use cases involve key-value stores where the partition key is often known early on, exposing PartitionHasher would facilitate more efficient pre-computation of tokens, enhancing overall performance and developer experience.
The text was updated successfully, but these errors were encountered: