DGS-4172 Bound size of Avro datumReader/Writer caches #2331

rayokota · 2022-07-12T20:45:47Z

No description provided.

avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroDeserializer.java

schema-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaSchemaSerDe.java

avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroDeserializer.java

tnn

@rayokota Thanks for picking this up so fast! Did you consider porting the tests from #2330? I know the one I wrote is not perfect, but some kind of verification would be preferable.

avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroDeserializer.java

tnn · 2022-07-12T21:54:00Z

I tested the cache implementation on the latest master / 7.3.0, and my test shows that the AbstractKafkaAvroSerializer.datumWriterCache still contains duplicate entries. The code I used to test is here: master...tnn:schema-registry:datumWriterCache-bug-guava-cache
Namely, IdentityPair is only used in the AbstractKafkaAvroDeserializer, not AbstractKafkaAvroSerializer.
How about adding tests? Tests was also requested, and ignored, on the original PR when this stuff was introduced in May.

rayokota · 2022-07-12T22:07:38Z

Hi @tnn . The cache for datumReader/Writer was introduced in CP 6.1.0 by a community contribution. Before that there was no cache. However, what we found is that using the cache with large schemas actually hurts performance. This is because the equals and hashCode methods for large schemas may be expensive. That is why we moved to identity (==) comparisons. With identity comparisons, the cache may contain two entries that return true for equals, but the cache won't contain two entries that return true for ==. In some cases (as it appears with your use case), the cache may not be hit if schemas are true for equals but false for ==. In this case the behavior will be similar to before the cache was introduced (I believe you are upgrading from 2.x.x). The cache will contain multiple entries that appear to be "duplicate", because they return true for equals (but false for ==). But since now the cache is bounded to default size 1000, you will not get an OOM. Also, the way the bound works is that older entries will be evicted once the cache approaches the bound.

tnn · 2022-07-14T18:53:19Z

large schemas actually hurts performance.

I follow that using the schema itself as the cache key is a suboptimal solution, generally that's also not the way a cache should be used. A suitable cache key can be constructed from a secure hash algorithm like SHA-256, which can digest ~3GiB/sec per core in Java on a modern Intel processor. It should not be a performance problem.

If concerned about performance, may I suggest not creating a new AvroSchema instance for every record in KafkaAvroSerializer#serialize?

moved to identity (==)

Thanks for recap of equality in Java. :)

(as it appears with your use case)

From my findings, the cache miss holds for any case where the Schema.Parser is not reused between records, which appear to be the common way with generated code.

I've made a new branch with regression tests that provides the cache efficiency for equality, identity and that it no longer cause OOM: master...tnn:datumWriterCache-bug-take2

bhuangecl · 2023-05-19T04:26:37Z

If concerned about performance, may I suggest not creating a new AvroSchema instance for every record in KafkaAvroSerializer#serialize?

Hi @rayokota, we have ran into this issue @tnn reported on the creation of AvroSchema instances and there is a significant impact on performance in our case. Is there currently any plan to revisit this?

DGS-4172 Bound size of Avro datumReader/Writer caches

7844034

rayokota mentioned this pull request Jul 12, 2022

AbstractKafkaAvroSerializer datumCache bugfix #2330

Closed

dragosvictor reviewed Jul 12, 2022

View reviewed changes

tnn reviewed Jul 12, 2022

View reviewed changes

avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroDeserializer.java Show resolved Hide resolved

avro-serializer/src/main/java/io/confluent/kafka/serializers/AbstractKafkaAvroDeserializer.java Show resolved Hide resolved

dragosvictor approved these changes Jul 12, 2022

View reviewed changes

rayokota merged commit ed96278 into confluentinc:6.1.x Jul 12, 2022

rayokota deleted the DGS-4172 branch July 12, 2022 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DGS-4172 Bound size of Avro datumReader/Writer caches #2331

DGS-4172 Bound size of Avro datumReader/Writer caches #2331

rayokota commented Jul 12, 2022

tnn left a comment

tnn commented Jul 12, 2022

rayokota commented Jul 12, 2022 •

edited

tnn commented Jul 14, 2022

bhuangecl commented May 19, 2023

DGS-4172 Bound size of Avro datumReader/Writer caches #2331

DGS-4172 Bound size of Avro datumReader/Writer caches #2331

Conversation

rayokota commented Jul 12, 2022

tnn left a comment

Choose a reason for hiding this comment

tnn commented Jul 12, 2022

rayokota commented Jul 12, 2022 • edited

tnn commented Jul 14, 2022

bhuangecl commented May 19, 2023

rayokota commented Jul 12, 2022 •

edited