Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent GC data size metrics with Shenandoah and ZGC #2872

Closed
SimSonic opened this issue Nov 23, 2021 · 15 comments · Fixed by #2875
Closed

Intermittent GC data size metrics with Shenandoah and ZGC #2872

SimSonic opened this issue Nov 23, 2021 · 15 comments · Fixed by #2875
Labels
bug A general bug regression A regression from a previous release
Milestone

Comments

@SimSonic
Copy link

SimSonic commented Nov 23, 2021

I'm using Spring Boot (2.5.x) with Java 17 (Liberica JDK alpine-musl) and Shenandoan GC.

For a bunch of latest Spring Boot versions some GC's metrics are sometimes going to zero:

  • jvm_gc_live_data_size_bytes
  • jvm_gc_max_data_size_bytes

image

Previously (I don't remember exactly, but possibly on 2.4.x branch) — this graph was also continuous.

This is not case of failed prometheus scrapes, all other graphs are continuous.

It is looking weird now so I believe there is a bug somewhere.

Thanks.

@checketts
Copy link
Contributor

Does downgrading micrometer or the JVM fix it? (Just to ensure that is the link)

Does your chart 'null as zero' set or you are actually seeing zero?

@SimSonic
Copy link
Author

I'm actually seeing zero.

It started on JDK 16 too. I don't tried to downgrade micrometer, the version I use is from spring boot's dependencies.

I just can remember that on Spring Boot 2.4.x there where no drops to zero.

@jonatan-ivanov
Copy link
Member

I don't think Spring Boot has anything to do with this, my guess would be that Micrometer gets zero from JMX.
Could you please help us investigating this by doing two things?

  1. Could you please check a non-alpine image that has glibc (I think this worth a try but I'm not expecting that the behavior will be different)?
  2. Could you please create a @Scheduled method and call the same method that Micrometer does (see: JvmGcMetrics) and see if that gives you zero?

@jonatan-ivanov jonatan-ivanov added the waiting for feedback We need additional information before we can continue label Nov 24, 2021
@SimSonic
Copy link
Author

@jonatan-ivanov 👌 I'll do it on this week!

@shakuzen
Copy link
Member

Could you please create a @Scheduled method and call the same method that Micrometer does (see: JvmGcMetrics) and see if that gives you zero?

Both the live and max size are controlled by an AtomicLong that is set in the logic inside the GC notification listener. I think it will be hard to do the same thing without duplicating the notification listener code from JvmGcMetrics. At that point, it's probably easiest to just periodically log the live and max gauge values to confirm they are reporting as zero at some point in a reproducible way.

If you can provide a minimal sample project that we can run to see this behavior, it will help us investigate.

@SimSonic
Copy link
Author

@shakuzen I was able to reproduce it on Windows right in IDEA:
issue2872.zip

image

VM Options to start application are -XX:+UseShenandoahGC -Xmx100M.

@shakuzen
Copy link
Member

How reliably are you able to reproduce that result? How long is the application running before you see it? I've tried running the application on my MacOS and Windows machine for about 30 minutes each and I don't see that. I'm seeing the value change but not with a 0 in between.

@SimSonic
Copy link
Author

SimSonic commented Nov 24, 2021

@shakuzen Did you set Xmx to force GC working? Maybe you need tune allocation rate for your machine...

On my PC it run for less than 1 minute until I saw zero.

What is your JDK vendor? I'm using Bellsoft Liberica 17 Full JDK on Windows and bellsoft/liberica-openjdk-alpine-musl:17 for staging/production servers.

@shakuzen shakuzen added regression A regression from a previous release and removed waiting for feedback We need additional information before we can continue labels Nov 24, 2021
@shakuzen shakuzen added this to the 1.7.x milestone Nov 24, 2021
@shakuzen
Copy link
Member

I'm using the Liberica distribution of 17 on MacOS and Liberica's 17.0.1 downloaded via IntelliJ on Windows. I'm using the VM options you mentioned -XX:+UseShenandoahGC -Xmx100M when running in both places.

After trying some more, I was able to reproduce it. It's still not very easy for me to reproduce, which makes it time consuming to investigate. I think can see what the issue is now.

We process GarbageCollectionNotificationInfo objects using a listener for recording the gc metrics. When the gc notification info object has the name Shenandoah Pauses, which are produced during phases of Shenandoah collection that pause, the memory usage before/after collection are all 0. I'm seeing that only GCNotificationInfo for Shenandoah Cycles which correspond to concurrent phases of Shenandoah collection, the memory usage before/after are available.

I'll have to take a closer look tomorrow and write up tests, but it seems this probably doesn't happen in Micrometer 1.6.x and was introduced as a regression in 1.7.0 with #2560.

@shakuzen shakuzen changed the title Intermittent GC data size metrics with Shenandoah GC Intermittent GC data size metrics with non-generational GC Nov 25, 2021
@shakuzen shakuzen linked a pull request Nov 26, 2021 that will close this issue
@shakuzen shakuzen changed the title Intermittent GC data size metrics with non-generational GC Intermittent GC data size metrics with Shenandoah and ZGC Nov 26, 2021
@shakuzen shakuzen modified the milestones: 1.7.x, 1.7.7 Nov 26, 2021
@jonatan-ivanov
Copy link
Member

fixed by #2875

@shakuzen
Copy link
Member

shakuzen commented Dec 8, 2021

@SimSonic if you could try out 1.7.7 or 1.8.1 and let us know if this is fixed for you, that'd be great.

@SimSonic
Copy link
Author

SimSonic commented Dec 8, 2021

@shakuzen yes, can check in two days.

@SimSonic
Copy link
Author

SimSonic commented Dec 9, 2021

@shakuzen Looking okay for about 4+ hours. This is a qa stage so have no production load. Keeping for a night.
image

@SimSonic
Copy link
Author

SimSonic commented Dec 10, 2021

Confirms absence of zeroes for 17.5 hours.

Thank you for your time!

@shakuzen
Copy link
Member

Thank you for letting us know about the issue and confirming the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A general bug regression A regression from a previous release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants