Memory leak #857

bsandmann · 2024-01-23T11:09:26Z

Is this a regression?

Yes

Description

It seems like the prism-agents (tested on 1.19.1) have a memory leak. I’m running a single agent on a Ubuntu 22.04 using the “docker-compose.yml” setup as described on the readme of the Github page. The memory usage of that single agent was slowly increasing around 1GB a day – and that’s only a single agent at idle with no user-interaction at all. I haven’t investigated, if it is the agent, the node or some other component, but it’s something someone should look into. See screenshot. I’m currently also testing 1.24.0, but looks like, it’s also leaking at the same rate.

Please provide the exception or error you saw

No response

Please provide the environment you discovered this bug in

No response

Anything else?

No response

ghost · 2024-01-23T11:48:34Z

Thanks @bsandmann - our agents running in Kubernetes haven't seen similar behaviour, this isn't to say there isn't a leak, it might just be the constraints and garbage collection are hiding it - checking the docker-compose.yml - we don't set any restrictions on memory usage - I'll run this test on a Ubunutu host now and report back in a few days to confirm the issue

ghost · 2024-02-06T11:30:48Z

Update: I've run agents using the docker compose file included within the repo for several days at a time. I can observe a build up of memory usage but garbage collection kicks in and reduces it, the GC saw-tooth pattern is present so all looks good. I don't want to close this just yet until I've done a specific test to concretely prove there is no leak - as such - I'll be soak testing some agents towards the end of this week and I'll post the results on this ticket.

I'll leave the agents idle for 24 hours, soak test for 24 hours (if possible, may need to be shorter) and then leave idle again for 24 hours

ghost · 2024-02-20T14:23:58Z

Update: I've had issues running these tests (due to local hardware, nothing to do with Agent) and need to repeat them

bsandmann · 2024-02-20T14:58:40Z

Here is the development over a month (continuation of the earlier image/test)

FabioPinheiro · 2024-02-20T16:04:39Z

tl;dr

To investigate a memory leak I care about the part where this graft is stable.
So the part that the memory used by the JVM doesn't increase.

After that:

I want to know how frequently does the garbage collection runs.
Does the frequently increases?
Does the memory use come always to the same base or is the base line increasing, when the garbage collection does a full collection (not a quick one)?

@davidpoltorak-io So my suggestion would be to limit the memory of the JVM to 4/5 or 6 GB (the minimum for the system to be stable and where the garbage collection starts working). Let's print GC stats/details.

Not sure what is the right flag we need to start the JVM to print this information. Maybe -Xlog:gc if it's not too verbose. (Also, let's config this log:gc to write to a file instead console (standard output)

mkbreuningIOHK · 2024-03-01T09:01:53Z

@davidpoltorak-io , do you have a contributor account to reassign this issue to? cc @yshyn-iohk @mineme0110

yshyn-iohk · 2024-03-26T13:55:09Z

@bsandmann, we have a screenshot from the Grafana dashboard:
One of the agents in our SIT environment (last three days):

And a similar screenshot for the same agent over the last seven days:

These pictures don't look like a memory leak issue.

Could you share how you run the agent and any other essential details for reproducing it?

bsandmann · 2024-03-26T15:31:07Z

@yshyn-iohk Thanks for taking a look at it.

I'm following the Quick-Start instructions without any modifications on an Ubuntu 22.04 installation. I haven't looked deeper into the issue yet, but I've noticed this behavior on similar setups with one or multiple agents. Some ideas:

It might not be the agent itself causing the problem but rather some other component that is started alongside it.
The issue could also be related to the virtualization setup (Proxmox) being somewhat incompatible with something inside the Docker containers. However, I'm running similar VMs with other images just fine.
I'm also running out of disk space on those VMs with the agents. The agent has used up a few GB over the last weeks, just idling, and now it's starving the VM of all its disk space. I just noticed this now. There's nothing else but the agent running on that VM, and it must have filled up 32GB in a few weeks. I haven't had the time to look into that yet.
Any ideas?

robertocarvajal · 2024-04-05T18:35:26Z

please run docker container stats to check how much ram the agent containers are actually using. I think they shouldn't be nowhere near that..

What I think may be happening is docker itself basically eating all the available RAM on the vm, you can limit how much ram can docker use of course, but by default it will be happy to use all that's available and reserve it for the containers.

https://docs.docker.com/config/containers/resource_constraints/

I realized I have the same problem on my test deploy :)

bsandmann · 2024-05-14T11:13:05Z

I have a VM now running for 12 days. On three occasions I used the commands htop, free, docker container stats, and df to get a slightly better picture of what is going on. This is obviously far from an in-depth analysis but gives a rough picture.

I haven't looked at it in detail, but there are some first conclusions:

The problem isn't as big as initially thought, due to the fact that Proxmox can't look into the Linux VM to see the actual memory allocation. This can cause it to display a higher memory footprint than is actually used. Nonetheless, this is not generally happening for all Linux VM workloads and still might point to some unusual behavior. Which brings us to:
There is still a small memory increase noticeable over the 12 days as well as a disk space increase. I'll further let it run to see if the memory increase persists or if it is just a zigzag pattern which isn't detectable with just three snapshots.

If you have any comments on which other data to collect from those VMs, let me know. I'm planning to let the VM run for another 1-2 weeks and maybe we get something out of it.

yshyn-iohk · 2024-05-21T14:51:39Z

Thanks, @bsandmann, for the additional information!
I will also check the memory footprint of the Cloud Agent v1.33.0 in our SIT environment and update this ticket later this week.

bsandmann · 2024-05-28T07:22:37Z

@yshyn-iohk Here is an additional data-capture from the VM:

A few things are worth noticing:

The GC seems to keep the increasing memory footprint in check. So technically this can't really be classified as a memory leak in my opinion. Nonetheless the memory allocation pattern is a bit unusual and likely more than necessary for an application at idle. Without having looked at the code I would say it is due to an sub-optimal handling of logs or strings.
Between 20th and 28th the memory footprint of all applications stayed somewhat identical. Did we reach a peak or is just a lucky snapshot?
The disk-usage further increased: When my calculation is correct, each instance (I'm running 2 agents) is writing about 45MB of data to the disk daily. That's nearly 1,5GB a month and around 17GB a year ... at idle. I guess the memory allocation issue is closely related to this creation of data - properly logs.

bsandmann added the Type: Bug label Jan 23, 2024

ghost self-assigned this Jan 23, 2024

yshyn-iohk assigned mineme0110 Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #857

Memory leak #857

bsandmann commented Jan 23, 2024

ghost commented Jan 23, 2024

ghost commented Feb 6, 2024

ghost commented Feb 20, 2024

bsandmann commented Feb 20, 2024

FabioPinheiro commented Feb 20, 2024

mkbreuningIOHK commented Mar 1, 2024

yshyn-iohk commented Mar 26, 2024

bsandmann commented Mar 26, 2024 •

edited

robertocarvajal commented Apr 5, 2024

bsandmann commented May 14, 2024

yshyn-iohk commented May 21, 2024

bsandmann commented May 28, 2024 •

edited

Memory leak #857

Memory leak #857

Comments

bsandmann commented Jan 23, 2024

Is this a regression?

Description

Please provide the exception or error you saw

Please provide the environment you discovered this bug in

Anything else?

ghost commented Jan 23, 2024

ghost commented Feb 6, 2024

ghost commented Feb 20, 2024

bsandmann commented Feb 20, 2024

FabioPinheiro commented Feb 20, 2024

mkbreuningIOHK commented Mar 1, 2024

yshyn-iohk commented Mar 26, 2024

bsandmann commented Mar 26, 2024 • edited

robertocarvajal commented Apr 5, 2024

bsandmann commented May 14, 2024

yshyn-iohk commented May 21, 2024

bsandmann commented May 28, 2024 • edited

bsandmann commented Mar 26, 2024 •

edited

bsandmann commented May 28, 2024 •

edited