Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DGS-4172 Bound size of Avro datumReader/Writer caches #2331

Merged
merged 1 commit into from
Jul 12, 2022

Conversation

rayokota
Copy link
Member

No description provided.

Copy link

@tnn tnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rayokota Thanks for picking this up so fast! Did you consider porting the tests from #2330? I know the one I wrote is not perfect, but some kind of verification would be preferable.

@rayokota rayokota merged commit ed96278 into confluentinc:6.1.x Jul 12, 2022
@rayokota rayokota deleted the DGS-4172 branch July 12, 2022 21:43
@tnn
Copy link

tnn commented Jul 12, 2022

I tested the cache implementation on the latest master / 7.3.0, and my test shows that the AbstractKafkaAvroSerializer.datumWriterCache still contains duplicate entries. The code I used to test is here: master...tnn:schema-registry:datumWriterCache-bug-guava-cache
Namely, IdentityPair is only used in the AbstractKafkaAvroDeserializer, not AbstractKafkaAvroSerializer.
How about adding tests? Tests was also requested, and ignored, on the original PR when this stuff was introduced in May.

@rayokota
Copy link
Member Author

rayokota commented Jul 12, 2022

Hi @tnn . The cache for datumReader/Writer was introduced in CP 6.1.0 by a community contribution. Before that there was no cache. However, what we found is that using the cache with large schemas actually hurts performance. This is because the equals and hashCode methods for large schemas may be expensive. That is why we moved to identity (==) comparisons. With identity comparisons, the cache may contain two entries that return true for equals, but the cache won't contain two entries that return true for ==. In some cases (as it appears with your use case), the cache may not be hit if schemas are true for equals but false for ==. In this case the behavior will be similar to before the cache was introduced (I believe you are upgrading from 2.x.x). The cache will contain multiple entries that appear to be "duplicate", because they return true for equals (but false for ==). But since now the cache is bounded to default size 1000, you will not get an OOM. Also, the way the bound works is that older entries will be evicted once the cache approaches the bound.

@tnn
Copy link

tnn commented Jul 14, 2022

large schemas actually hurts performance.

I follow that using the schema itself as the cache key is a suboptimal solution, generally that's also not the way a cache should be used. A suitable cache key can be constructed from a secure hash algorithm like SHA-256, which can digest ~3GiB/sec per core in Java on a modern Intel processor. It should not be a performance problem.

If concerned about performance, may I suggest not creating a new AvroSchema instance for every record in KafkaAvroSerializer#serialize?

moved to identity (==)

Thanks for recap of equality in Java. :)

(as it appears with your use case)

From my findings, the cache miss holds for any case where the Schema.Parser is not reused between records, which appear to be the common way with generated code.

I've made a new branch with regression tests that provides the cache efficiency for equality, identity and that it no longer cause OOM: master...tnn:datumWriterCache-bug-take2

@bhuangecl
Copy link

If concerned about performance, may I suggest not creating a new AvroSchema instance for every record in KafkaAvroSerializer#serialize?

Hi @rayokota, we have ran into this issue @tnn reported on the creation of AvroSchema instances and there is a significant impact on performance in our case. Is there currently any plan to revisit this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants