KAFKA-16770; [1/N] Coalesce records into bigger batches #15964

dajac · 2024-05-15T14:48:36Z

We have discovered during large scale performance tests that the current write path of the new coordinator does not scale well. The issue is that each write operation writes synchronously from the coordinator threads. Coalescing records into bigger batches helps drastically because it amortizes the cost of writes. Aligning the batches with the snapshots of the timelines data structures also reduces the number of in-flight snapshots.

This patch is the first of a series of patches that will bring records coalescing into the coordinator runtime. As a first step, we had to rework the PartitionWriter interface and move the logic to build MemoryRecords from it to the CoordinatorRuntime. The main changes are in these two classes. The others are related mechanical changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

jolshan · 2024-05-16T00:30:31Z

...dinator/src/test/java/org/apache/kafka/coordinator/group/runtime/CoordinatorRuntimeTest.java

@@ -1072,8 +1174,8 @@ public void replay(
    @Test
    public void testScheduleWriteOpWhenWriteFails() {
        MockTimer timer = new MockTimer();
-        // The partition writer only accept on write.
-        MockPartitionWriter writer = new MockPartitionWriter(2);
+        // The partition writer only accept one write.


for my understanding, we always batched the (in this case 2) records that were part of the same write operation. For now we aren't changing this, but moving the logic to the coordinator runtime to make space for the batching logic as a followup?

You got it right. A write operation produces a single batch with all the records generated by it. This patch does not change it but change where the memory record is built. The next patch will add the logic to keep the batch open until full or until a linger time is reached. With this, records produced by many write operations will end up in the same batch.

jolshan · 2024-05-16T00:31:40Z

I took a first pass to get a general understanding. I will come back tomorrow and take a deeper dive in some of the minor changes and let you know if i think of anything missed.

jolshan · 2024-05-16T17:41:25Z

...coordinator/src/main/java/org/apache/kafka/coordinator/group/runtime/CoordinatorRuntime.java

-                            result
+                            VerificationGuard.SENTINEL,
+                            MemoryRecords.withEndTransactionMarker(
+                                time.milliseconds(),


It seems we didn't specify this time value before. Was that a bug? I guess it also just gets the system time in the method.

withEndTransactionMarker takes the current time if we don't specify it. The reason why I set it explicitly here is to ensure that the mock time is used in tests.

jolshan · 2024-05-16T17:53:51Z

...coordinator/src/main/java/org/apache/kafka/coordinator/group/runtime/CoordinatorRuntime.java

+                        byte magic = logConfig.recordVersion().value;
+                        int maxBatchSize = logConfig.maxMessageSize();
+                        long currentTimeMs = time.milliseconds();
+                        ByteBuffer buffer = context.bufferSupplier.get(Math.min(16384, maxBatchSize));


Nice we got rid of the thread local. 👍

jolshan · 2024-05-16T17:56:31Z

...coordinator/src/main/java/org/apache/kafka/coordinator/group/runtime/CoordinatorRuntime.java

-                                // coordinator is the single writer to the underlying partition so we can
-                                // deduce it like this.
-                                for (int i = 0; i < result.records().size(); i++) {
+                            MemoryRecordsBuilder builder = MemoryRecords.builder(


nit: is there a benefit from putting this here and not right before the append method?

The builder is used in the above loop (L801) so we need it here.

jolshan · 2024-05-16T18:01:17Z

group-coordinator/src/main/java/org/apache/kafka/coordinator/group/runtime/PartitionWriter.java


    /**
     * Listener allowing to listen to high watermark changes. This is meant
-     * to be used in conjunction with {{@link PartitionWriter#append(TopicPartition, List)}}.
+     * to be used in conjunction with {{@link PartitionWriter#append(TopicPartition, VerificationGuard, MemoryRecords)}}.


Is there a programatic way to check if these links are broken due to refactoring, or do you need to do it manually?

Just wondering if there is an easy way to check you did them all :)

Intellij reports them as warning. I suppose that we would get warning when we generate the javadoc too.

jolshan · 2024-05-16T18:18:24Z

...inator/src/test/java/org/apache/kafka/coordinator/group/runtime/InMemoryPartitionWriter.java

 */
-public class InMemoryPartitionWriter<T> implements PartitionWriter<T> {
-
-    public static class LogEntry {


nice that we could just use the real memory records

jolshan · 2024-05-16T18:24:12Z

core/src/test/scala/unit/kafka/coordinator/group/CoordinatorPartitionWriterTest.scala

@@ -84,98 +67,28 @@ class CoordinatorPartitionWriterTest {
  }

  @Test
-  def testWriteRecords(): Unit = {


Do we have an equivalent test for the writing of the records in CoordinatorRuntimeTest? I didn't really notice new tests, but saw we have some of the builder logic there. Is it tested by checking equality between the records generated by the helper methods and the output from running the CoordinatorRuntime code?

Right. We have many tests in CoordinatorRuntimeTest doing writes. As we fully validate the records now, they cover this.

dajac · 2024-05-17T08:06:32Z

@jolshan Thanks for your comments. I replied to them.

We have discovered during large scale performance tests that the current write path of the new coordinator does not scale well. The issue is that each write operation writes synchronously from the coordinator threads. Coalescing records into bigger batches helps drastically because it amortizes the cost of writes. Aligning the batches with the snapshots of the timelines data structures also reduces the number of in-flight snapshots. This patch is the first of a series of patches that will bring records coalescing into the coordinator runtime. As a first step, we had to rework the PartitionWriter interface and move the logic to build MemoryRecords from it to the CoordinatorRuntime. The main changes are in these two classes. The others are related mechanical changes. Reviewers: Justine Olshan <jolshan@confluent.io>

KAFKA-16770; [1/N] Coalesce records into bigger batches

c6e784c

dajac added the KIP-848 label May 15, 2024

dajac requested a review from jolshan May 15, 2024 14:48

Merge remote-tracking branch 'upstream/trunk' into KAFKA-16770

61baaff

jolshan reviewed May 16, 2024

View reviewed changes

fix javadoc

cd914dc

jolshan approved these changes May 20, 2024

View reviewed changes

dajac merged commit b4c2d66 into apache:trunk May 21, 2024
1 check failed

dajac deleted the KAFKA-16770 branch May 21, 2024 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-16770; [1/N] Coalesce records into bigger batches #15964

KAFKA-16770; [1/N] Coalesce records into bigger batches #15964

dajac commented May 15, 2024

jolshan May 16, 2024 •

edited

dajac May 16, 2024

jolshan commented May 16, 2024

jolshan May 16, 2024

dajac May 17, 2024

jolshan May 16, 2024

jolshan May 16, 2024

dajac May 17, 2024

jolshan May 16, 2024

dajac May 17, 2024

jolshan May 16, 2024

jolshan May 16, 2024

dajac May 17, 2024

dajac commented May 17, 2024

KAFKA-16770; [1/N] Coalesce records into bigger batches #15964

KAFKA-16770; [1/N] Coalesce records into bigger batches #15964

Conversation

dajac commented May 15, 2024

Committer Checklist (excluded from commit message)

jolshan May 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jolshan commented May 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dajac commented May 17, 2024

jolshan May 16, 2024 •

edited