Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak when request body is anything but JSON #2665

Closed
carlosbarragan opened this issue Jan 20, 2020 · 26 comments
Closed

Memory leak when request body is anything but JSON #2665

carlosbarragan opened this issue Jan 20, 2020 · 26 comments
Assignees
Labels
status: awaiting feedback status: awaiting validation Waiting to be validated as a real issue
Milestone

Comments

@carlosbarragan
Copy link

carlosbarragan commented Jan 20, 2020

Step to reproduce

The bug cannot be easily reproduced locally. However I have data that supports the problem at hand.

Some information about our application:

Our micronaut application runs in production on a K8s cluster and gets around 5K request per minute. We use kotlin and Reactor in our application.

We have been observing that the memory consumption increases with time and after around 6 hours the Pods need to be restarted by K8s because they are out of memory.
If the load increases, peaks etc., then the restart happens before 6 hours.

The application gets requests in XML and calls other services (using JSON) and does some Database (MongoDB) calls. Really nothing extraordinary. In order to handle the XML, we inject the body in the controller as a String as follows:

fun createSession(@Body rawBody: String, httpHeader: HttpHeaders): Mono<out HttpResponse<String>> 

and afterwards we parse the request using the XmlMapper from Jackson.

Since the clients making the requests to the application are programmed in a way that they always expect an HTTP 200, the application always returns 200 even if there are errors. That is also another reason why we parse the payload inside the controller (the first reason being lack of support for parsing XML directly by Micronaut), in order to handle exceptions that may happen when parsing the request.

A link to a stripped down version of the applicaiton can be found at the bottom of this issue. Please notice that we removed a lot of code that was not necessary to describe our problem.

Analysis

We did a memory dump analysis (see screenshot) and observed that obects of type io.netty.buffer.PoolChunk were consuming a lot of memory and, somehow, they weren't being released. These objects were actually holding the request body as seen in the screenshot (actual request is not shown in full due to privacy agreement with our customer). The screenshot was taken around 30 minutes after deployment. We observed that these objects were accumulating during the livespan of the application and therefore were the cause of the high memory consumption that leads to the Pods being restarted by K8s every 6 hours.

Objects in Heap
ObjectsInHeap

Content of the PoolChunk object
PayloadMemLeak

We also observed several log statements as follows:
LEAK: ByteBuf.release() was not called before it's garbage-collected.

We tried our best to reproduce the problem locally. To do that we needed to create similar load as in production. That was not the difficult part. The problem was that we did not want to stress-load the other systems that were being called by the application for a long time, therefore it was not possible for us to reproduce the bug locally.

At this point in time our assumption was that most of the request bodies were not being collected by GC but we didn't know why.

However, while doing a debug session, with a lot of breakpoints all over the place, suddenly the Netty leak detector was called and we could determine the exact location of the leak (see screenshot). We were testing with several requests and we noticed one request failed because one of the external services responded with error. We then execute the exact same request and then, the leak detector was called. Could it be that the error of the external service caused Netty not to clean up the request?

Netty Leak detector:
NettyLeakDetector

The location of the reported leak was the ByteBufConverters. Apparently, the convereter transforms a ByteBuffer into a String but it never releases the ByteBuffer. After further investigation, we found out that when Micronaut processes a JSON payload, it does release the payload. In the class JsonContentProcessor method onData() you can find the follwoing code:

 protected void onData(ByteBufHolder message) {
      ByteBuf content = message.content();

      try {
          byte[] bytes = ByteBufUtil.getBytes(content);
          this.jacksonProcessor.onNext(bytes);
      } finally {
          ReferenceCountUtil.release(content);
      }

  }

So, it looks like Micronaut reads the payload from the ByteBuffer, parses it and then releases the ByteBuffer afterwards. That does not happen with any other kind of payload or at least we could not find it.

That piece of code gave us the idea for our workaround

Workaraound

Instead of injecting the body as String, we inject it as a CompositeByteBuf. By doing this, Micronaut does not do any kind of conversion, so we got the payload as is and we get the contents in a similar way like the JsonContentProcessor meaning, we read the ByteBuffer and release it ourselves afterwards:

fun createSession(@Body rawBody: CompositeByteBuf, httpHeader: HttpHeaders): Mono<out HttpResponse<String>> {
    val rawBody: String = bodyBytes.releaseAndGetString()

}

The releaseAndGetString() extension function is implemented as follows:

fun ByteBuf.releaseAndGetString(): String {
    return try {
        val byteArray = ByteBufUtil.getBytes(this)
        String(byteArray)
    } finally {
        ReferenceCountUtil.release(this)
    }
}

We applied this workaround and deployed to production and since then everything is running smoothly. No Pods restarts in the last 5 days and memory consumption remains stable at around 200 MB of heap. We think, this is a strong indicator that the payload was not being properly realeased either by Micronaut or by Netty.

Expected Behaviour

No memory leaks should happen when injecting a body that is not json. If the developer is expected to take care of releasing the ByteBuffer it should be documented that way.

Environment Information

  • Operating System: Running in Docker based on image openjdk:11-jdk-slim-stretch
  • Micronaut Version: 1.2.6
  • JDK Version: 11

Example Application

https://github.com/carlosbarragan/micronaut-memleak-example

@jameskleeh
Copy link
Contributor

Also I don't think this is an issue in 1.3 because of the changes there to binding to raw bodies

@carlosbarragan carlosbarragan changed the title Memory leak when request body is anyhting but JSON Memory leak when request body is anything but JSON Jan 21, 2020
@carlosbarragan
Copy link
Author

The body should be cleaned up here https://github.com/micronaut-projects/micronaut-core/blob/1.2.x/http-server-netty/src/main/java/io/micronaut/http/server/netty/RoutingInBoundHandler.java#L198

@jameskleeh I debugged our application by setting a breakpoint there but the code was never called.

Is there already a snapshot / mileston release for 1.3? I would like to test our application with 1.3.

@jameskleeh
Copy link
Contributor

@carlosbarragan Yes there are 2 milestones and snapshots for 1.3

@jameskleeh
Copy link
Contributor

Closing this due to a lack of feedback. If you can reproduce this issue with 1.3, please let me know and I'll reopen.

@carlosbarragan
Copy link
Author

We'll try to test with 1.3 as soon as possible.

BTW, we noticed we cannot pass the body as @Body rawBody: CompositeByteBuf in Micronaut 1.2.7 and later. We get an UnsatisfiedRouteResolutionException. Apparently, the converters cannot handle that. It worked with 1.2.6

@carlosbarragan
Copy link
Author

We tested with 1.3 and, sadly, the memory leak is still there.

Here are the stats:

With 1.2.6 and passing the body as CompositeByteBuf as described in the workaround and after running for 20 hours:

  • Heap usage around 180 MB
  • Total Memory requested by the pod around 600 MB

With 1.3.0 and passing the body as String (passing it as CompositeByteBuf does not work) after running for 90 minutes:

  • Heap usage around 500 MB
  • Total memory requested by the pod around 900 MB.

@jameskleeh please reopen as the problem still persists.

@graemerocher graemerocher reopened this Feb 12, 2020
@graemerocher graemerocher added the status: awaiting validation Waiting to be validated as a real issue label Feb 12, 2020
@graemerocher graemerocher added this to the 1.3.1 milestone Feb 12, 2020
@carlosbarragan
Copy link
Author

I find it strange that the JsonContentProcessor.onData releases the ByteBuf after passing the payload to the jacksonProcessor but nothing similar happens for other kind of payloads. Maybe it is done somewhere else but I just wanted to mention it.

@graemerocher
Copy link
Contributor

graemerocher commented Feb 12, 2020

So far I took the example application and enabled paranoid level resource leak detection:

  @JvmStatic
    fun main(args: Array<String>) {
        ResourceLeakDetector.setLevel(ResourceLeakDetector.Level.PARANOID)

Then created a postdata.txt file as follows:

<Session-Request><device-id>test1</device-id><device-context>test2</device-context></Session-Request>

Then ran ab and put the application under load:

ab -p postdata.txt -T application/xml -c 20 -n 10000 http://localhost:8080/api/sessions

I observed that in the profiler there were indeed some byte buf objects records around and memory consumption went up to 800mb but they were placed in "Objects unreachable from GC roots but not yet collected" in YourKit.

Running the app with a memory cap java -Xmx64m -jar build/libs/mnleak-0.1-all.jar kept the memory consumption at around 250mb and triggered the GC to collect these objects which in fact were collected and didn't appear in YourKit.

@graemerocher
Copy link
Contributor

Within a debugger the bytebuf that produced the string was released here:

I observed no leak messages from Netty's ResourceLeakDetector whatsoever even when the application was placed under load.

Other than GC pauses the application was never unavailable at any point.

@graemerocher
Copy link
Contributor

In the example app I monitored it via VisualVM and after a GC cycle no classes in io.netty.buffer.* are retained.

@carlosbarragan
Copy link
Author

We also tried to reproduce the bug locally using the real application and putting it under load, but we could not successfully reproduce it.

We observed on production that the application starts consuming memory under load after some minutes ( > 30 mins). We have not tried to run it locally that long, but maybe this could help.

We also have observed that the memory increases rapidly when the application is getting way too much load (10,000 req / min per Pod). With the normal load, it takes way longer for the memory consumption to increase

@carlosbarragan
Copy link
Author

Within a debugger the bytebuf that produced the string was released here:

I observed no leak messages from Netty's ResourceLeakDetector whatsoever even when the application was placed under load.

Other than GC pauses the application was never unavailable at any point.

@graemerocher I debugged a single request on our actual application and the release() method in NettyHttpRequest was never called.

@graemerocher
Copy link
Contributor

@carlosbarragan Interesting. What we need is an example that reproduces that

@carlosbarragan
Copy link
Author

@graemerocher how can I provide an example without posting a real request due to privacy and NDA with our customer? I am open to do a video session with you or one of your team members if you believe this could help. Perhaps we are missing some configuration or something like that.

@graemerocher
Copy link
Contributor

I'm investigating some other approaches that moves that logic that calls release() elsewhere. Once I have something I will ask you to try a snapshot

graemerocher added a commit that referenced this issue Feb 12, 2020
@carlosbarragan
Copy link
Author

Ok.

FYI the application also has a Json API. I debugged a request to this API and the release() method wasn't called either.

@graemerocher
Copy link
Contributor

@carlosbarragan Can you try the 1.3.1.BUILD-SNAPSHOT version and report back? See https://docs.micronaut.io/latest/guide/index.html#usingsnapshots

@carlosbarragan
Copy link
Author

@graemerocher it turns out, the release() method was not being called because our REST Client was keeping the connection alive. As soon as we closed the REST Client the channelInactive() method was called as well as the release() method.

Nevertheless, we noticed only the last request body we sent was being released. All other requests we sent before closing the client were not clean up.

I don't know the internal workings of Netty but could it be that it reuses the same channel under load and therefore the channel never becomes inactive? This could lead to the situation where the requests are not properly released.

@graemerocher
Copy link
Contributor

@carlosbarragan does the snapshot resolve the issue?

@carlosbarragan
Copy link
Author

Haven’t tried yet. I’ll try it tomorrow morning

@graemerocher
Copy link
Contributor

@carlosbarragan I have pushed another potential fix since I am not 100% sure my first fix will solve the problem based on the information you supplied that is relates to persistent connections.

If you can try the snapshot tomorrow that would be great. Thanks for collaborating on this.

@carlosbarragan
Copy link
Author

@graemerocher Im almost certain this problem is related to Keep-alive connections. I quickly verified and could confirm that the keep-alive header is there in our prod environment.

@carlosbarragan
Copy link
Author

Fix looks promising. I tested it locally with persistent connection and the request was correctly cleaned up. I will deploy it and give feedback as soon as possible.

@carlosbarragan
Copy link
Author

@graemerocher we deployed the fix and let it run for about an hour and it seems it works!

Some stats of the application after running for 60 minutes:

  • Heap consumption is about 140 MB! (Better than with the workaround)
  • Pod memory consumption about 500 MB.

Thank you very much for your quick support!

@graemerocher
Copy link
Contributor

Great news, thanks for the feedback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: awaiting feedback status: awaiting validation Waiting to be validated as a real issue
Projects
None yet
Development

No branches or pull requests

3 participants