Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A fatal error has been detected by the Java Runtime Environment: SIGSEGV #370

Open
elshad-faire opened this issue Nov 16, 2022 · 29 comments

Comments

@elshad-faire
Copy link

I am using grpc kotlin for a simple client and server application where multiple clients connect to the server to get tests to execute and report results back. Very rarely the server will crash with SIGSEGV. The hs_err_pid file seems to suggest that this is happening at "io.grpc.kotlin.ServerCalls$serverCallListener$requests$1::invokeSuspend".

I looked online and most of similar crashes seemed to have different reasons. The report has link to https://github.com/adoptium/adoptium-support/issues for reports but I thought to consult grpc-kotlin first since it seems the issue happens at "io.grpc.kotlin.ServerCalls$serverCallListener$requests$1::invokeSuspend".

Can anyone let me know what might be going on? Note that this happens somewhere midway through the server. 105 clients connect to the server and after ~2000 tests were scheduled, this crash came in.

Attaching hs_err_pid file:
hs_err_pid1248.log

Please let me know if there is a better place to report this.

@jamesward
Copy link
Collaborator

It looks like the underlying error is:

java/lang/NoClassDefFoundError'{0x000000068be67e80}: org/slf4j/LoggerFactory> (0x000000068be67e80) 

So I think that there is some logic that is trying to log something and slf4j is not in the classpath. Does that seem right?

@elshad-faire
Copy link
Author

I saw this error. But wouldn't I get normal failure because NoClassDefFoundError exception is thrown? Would this cause jvm crash?

@jamesward
Copy link
Collaborator

Because there is likely some reflection underneath this, you won't see the error until a code path is followed that tries to reflectively find and use the class.

@elshad-faire
Copy link
Author

I see. I will then add slf4j in the classpath then. It indeed is not in the classpath.

I will mark this closed then as well. If I get the same issue after adding slf4j, then I will reopen this.

Thanks for the help!

@elshad-faire
Copy link
Author

elshad-faire commented Nov 24, 2022

I don't think class not being found is the cause of this issue. Even with reflection, there would be a different error. If some code uses reflect to get a class and then use it then the worst case will be NullPointerException.

In all the cases I get, the "Compilation events" section contains a stack trace where the last call is io.grpc.kotlin.ServerCalls$serverCallListener$requests$1::invokeSuspend.

Is it possible that this is a bug in corroutine usage in grpc kotlin?

@jamesward
Copy link
Collaborator

I think that usually when you see this kind of stacktrace it means that something in your code threw an unhandled exception and that was in a suspend function.

@elshad-faire
Copy link
Author

What I am confused about is why it causes SIGSEGV. Shouldn't I see a normal java crash due to unhandled exception?

Sorry my understanding of how Kotlin suspend works is limited. Is suspend implementation done on Java? Or is it cpp extension to java?

@jamesward
Copy link
Collaborator

Yeah, you are right. I think that one possible culprit is Netty as it uses a native transport by default and if that misbehaves you can get a SIGSEGV. Might be worth seeing if you can use a newer gRPC version as Netty is shaded (I think).

@elshad-faire
Copy link
Author

I checked all the stack traces of the crashes so far. Most of them have some netty stuff in the stack trace. However, one of them does not. Not sure if this means that maybe the source of the problem is not netty.

Also can you elaborate on how netty being shaded might cause issues?

@jamesward
Copy link
Collaborator

Netty being shaded isn't the source of the problem. It just means that to upgrade Netty you have to upgrade the library it is shaded in (in this case gRPC).

@Dontcampy
Copy link

Dontcampy commented Dec 5, 2022

I also got the similar issue in my service, it happened 3 times this month. Is there any way to solve it?
C2:27301382 14500 4 io.grpc.kotlin.ServerCalls$serverCallListener$requests$1::invoke (13 bytes)
The part of grpc dependencies

  implementation("io.grpc:grpc-protobuf:1.47.0")
  implementation("io.grpc:grpc-stub:1.47.0")
  runtimeOnly("io.grpc:grpc-netty-shaded:1.47.0")
  implementation("io.grpc:grpc-kotlin-stub:1.3.0")

hs_err_1.log

@Dontcampy
Copy link

After searching about "C2 CompilerThread0" error on openjdk bugs, I found they fixed a bug recently https://bugs.openjdk.org/browse/JDK-8285835, but I'm not sure the case is similar to us.

@jamesward
Copy link
Collaborator

A JDK bug could explain why more people don't experience the problem. If anyone can bump their JDK to fix this, that'd be really useful data.

@elshad-faire
Copy link
Author

I looked at https://bugs.openjdk.org/browse/JDK-8285835. Apparently the fix was https://git.openjdk.org/jdk/commit/8aa1526b443025b8606a3668262f46a9cb6ea6f6. The fix contains a java file which serves as a test to make sure that running that java code does not sigsegv. I compiled that code and ran it with the flags mentioned in the java file comments. I didnt get sigsegv. I ran it many times.

Does this mean that our issue is not the same?

@elshad-faire
Copy link
Author

Also for my case, the jdk seems to be the latest. In the SIGSEGV report the version is listed as JRE version: OpenJDK Runtime Environment Temurin-17.0.5+8 (17.0.5+8) (build 17.0.5+8). And it matches the latest mentioned in https://adoptium.net/temurin/releases/.

@jamesward
Copy link
Collaborator

It looks like the fix was merged into 19.0.2. Might be worth trying that JDK to see if it fixes it.

@elshad-faire
Copy link
Author

Sorry if this is a noob question but it mentiones that the fix is backported to 17 as well. Doesn't that mean 17 should be ok too?

@jamesward
Copy link
Collaborator

I'm not totally sure how to figure that out. The latest release of Termium 17 was Oct 25:
https://adoptium.net/temurin/archive/

So seems kinda unlikely the fix is in there.

@elshad-faire
Copy link
Author

Ok I see. I guess I could have checked that my self -_-

Will watch out for new releases and try it once out.

@elshad-faire
Copy link
Author

Just wanted to get back to this issue. We are still seeing this issue sometimes.

Now, I have another grpc server binary that will be run more often so we get hit by this issue a lot more. Hence, blocking us from making progress with the project.

I tried the latest version of Temurin 17 (Temurin-17.0.6+10) and the same issue happened.

See the crash file:
hs_err_pid1848.log

@jamesward , do you have any ideas of what I can try? Do you know a jdk version that works did not face this issue? I can try 18 or 19 as well.

@jebbench
Copy link

We've also been seeing this issue, I've raised tickets with Adoptium (adoptium/adoptium-support#721) and Kotlin (https://youtrack.jetbrains.com/issue/KT-56975).

We have been using Temurin 17 (Temurin-17.0.6+10) with Kotlin 1.7 and 1.8,

@elshad-faire
Copy link
Author

So the jdk I was using in Docker was eclipse-temurin:17-jdk-focal. And I changed to eclipse-temurin:19-jdk-focal. And this seems to have fixed the issue for me.

On my end, I will see if I can just productionize the work I am doing with eclipse-temurin:19-jdk-focal.

But I still feel like this should work with eclipse-temurin:17-jdk-focal as well because not everyone can easily upgrade jdk version. So I will just keep the issue open.

@ghost
Copy link

ghost commented Mar 1, 2023

Looks like the fix was merged into 17.0.7+4, which is not scheduled to be released until April 18 (according to https://wiki.openjdk.org/display/JDKUpdates/JDK+17u). That would answer why people with the latest 17 still getting the issue while those with jdk 19 are not.

@elshad-faire
Copy link
Author

@AdamMolnar4D41 thanks for the reply. I will keep using jdk19 for now and try out the new release of jdk17 when it is out. Then I will comment here if the issue is resolved with jdk17.

@jebbench
Copy link

This is still occuring in OpenJDK 64-Bit Server VM Temurin-17.0.7+7

@jamesward
Copy link
Collaborator

Thanks @jebbench for the update. Not sure if there is anything we can do as it seems like an OpenJDK bug from the other conversations.

@elshad-faire
Copy link
Author

@jamesward Shouldn't grpc-kotlin team at least try to understand why the SIGSEGV happens and open a relevant bug against OpenJDK? So far it seems that we have speculated what the issue might have been and various releases of OpenJDK 17 was tried with no luck. In addition, in our environment, we use many other libraries and they do not cause SIGSEGV.

@jamesward
Copy link
Collaborator

There are already a few bugs reported:
adoptium/adoptium-support#721
adoptium/adoptium-support#659
https://youtrack.jetbrains.com/issue/KT-56975/JVM-A-fatal-error-has-been-detected-by-the-Java-Runtime-Environment-SIGSEGV-0xb
https://youtrack.jetbrains.com/issue/KT-54693/SIGSEGV-0xb-at-pc0x0000000000000000-C2-CompilerThread0
https://bugs.openjdk.org/browse/JDK-8303279

I have also seen other reports of this outside of gRPC Kotlin. But without a good way to reproduce this, it is hard to get to the root cause. If someone can come up with that, it'd definitely help the Kotlin & JDK teams to narrow things down.

@Dontcampy
Copy link

https://bugs.openjdk.org/browse/JDK-8303279 has been fixed now. Does anyone have a good case to try it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants