Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #857

Open
bsandmann opened this issue Jan 23, 2024 · 12 comments
Open

Memory leak #857

bsandmann opened this issue Jan 23, 2024 · 12 comments
Assignees

Comments

@bsandmann
Copy link

Is this a regression?

Yes

Description

It seems like the prism-agents (tested on 1.19.1) have a memory leak. I’m running a single agent on a Ubuntu 22.04 using the “docker-compose.yml” setup as described on the readme of the Github page. The memory usage of that single agent was slowly increasing around 1GB a day – and that’s only a single agent at idle with no user-interaction at all. I haven’t investigated, if it is the agent, the node or some other component, but it’s something someone should look into. See screenshot. I’m currently also testing 1.24.0, but looks like, it’s also leaking at the same rate.

image

Please provide the exception or error you saw

No response

Please provide the environment you discovered this bug in

No response

Anything else?

No response

@ghost
Copy link

ghost commented Jan 23, 2024

Thanks @bsandmann - our agents running in Kubernetes haven't seen similar behaviour, this isn't to say there isn't a leak, it might just be the constraints and garbage collection are hiding it - checking the docker-compose.yml - we don't set any restrictions on memory usage - I'll run this test on a Ubunutu host now and report back in a few days to confirm the issue

@ghost ghost self-assigned this Jan 23, 2024
@ghost
Copy link

ghost commented Feb 6, 2024

Update: I've run agents using the docker compose file included within the repo for several days at a time. I can observe a build up of memory usage but garbage collection kicks in and reduces it, the GC saw-tooth pattern is present so all looks good. I don't want to close this just yet until I've done a specific test to concretely prove there is no leak - as such - I'll be soak testing some agents towards the end of this week and I'll post the results on this ticket.

I'll leave the agents idle for 24 hours, soak test for 24 hours (if possible, may need to be shorter) and then leave idle again for 24 hours

@ghost
Copy link

ghost commented Feb 20, 2024

Update: I've had issues running these tests (due to local hardware, nothing to do with Agent) and need to repeat them

@bsandmann
Copy link
Author

Here is the development over a month (continuation of the earlier image/test)
image

@FabioPinheiro
Copy link
Contributor

tl;dr

To investigate a memory leak I care about the part where this graft is stable.
So the part that the memory used by the JVM doesn't increase.

After that:

I want to know how frequently does the garbage collection runs.
Does the frequently increases?
Does the memory use come always to the same base or is the base line increasing, when the garbage collection does a full collection (not a quick one)?

@davidpoltorak-io So my suggestion would be to limit the memory of the JVM to 4/5 or 6 GB (the minimum for the system to be stable and where the garbage collection starts working). Let's print GC stats/details.

Not sure what is the right flag we need to start the JVM to print this information. Maybe -Xlog:gc if it's not too verbose. (Also, let's config this log:gc to write to a file instead console (standard output)

@mkbreuningIOHK
Copy link
Contributor

@davidpoltorak-io , do you have a contributor account to reassign this issue to? cc @yshyn-iohk @mineme0110

@yshyn-iohk
Copy link
Member

@bsandmann, we have a screenshot from the Grafana dashboard:
One of the agents in our SIT environment (last three days):
image

And a similar screenshot for the same agent over the last seven days:
image

These pictures don't look like a memory leak issue.

Could you share how you run the agent and any other essential details for reproducing it?

@bsandmann
Copy link
Author

bsandmann commented Mar 26, 2024

@yshyn-iohk Thanks for taking a look at it.

I'm following the Quick-Start instructions without any modifications on an Ubuntu 22.04 installation. I haven't looked deeper into the issue yet, but I've noticed this behavior on similar setups with one or multiple agents. Some ideas:

  • It might not be the agent itself causing the problem but rather some other component that is started alongside it.
  • The issue could also be related to the virtualization setup (Proxmox) being somewhat incompatible with something inside the Docker containers. However, I'm running similar VMs with other images just fine.
  • I'm also running out of disk space on those VMs with the agents. The agent has used up a few GB over the last weeks, just idling, and now it's starving the VM of all its disk space. I just noticed this now. There's nothing else but the agent running on that VM, and it must have filled up 32GB in a few weeks. I haven't had the time to look into that yet.
    Any ideas?

@robertocarvajal
Copy link
Contributor

please run docker container stats to check how much ram the agent containers are actually using. I think they shouldn't be nowhere near that..

What I think may be happening is docker itself basically eating all the available RAM on the vm, you can limit how much ram can docker use of course, but by default it will be happy to use all that's available and reserve it for the containers.

https://docs.docker.com/config/containers/resource_constraints/

I realized I have the same problem on my test deploy :)

@bsandmann
Copy link
Author

I have a VM now running for 12 days. On three occasions I used the commands htop, free, docker container stats, and df to get a slightly better picture of what is going on. This is obviously far from an in-depth analysis but gives a rough picture.
Identus_ressource_usage
I haven't looked at it in detail, but there are some first conclusions:

  1. The problem isn't as big as initially thought, due to the fact that Proxmox can't look into the Linux VM to see the actual memory allocation. This can cause it to display a higher memory footprint than is actually used. Nonetheless, this is not generally happening for all Linux VM workloads and still might point to some unusual behavior. Which brings us to:
  2. There is still a small memory increase noticeable over the 12 days as well as a disk space increase. I'll further let it run to see if the memory increase persists or if it is just a zigzag pattern which isn't detectable with just three snapshots.

If you have any comments on which other data to collect from those VMs, let me know. I'm planning to let the VM run for another 1-2 weeks and maybe we get something out of it.

@yshyn-iohk
Copy link
Member

Thanks, @bsandmann, for the additional information!
I will also check the memory footprint of the Cloud Agent v1.33.0 in our SIT environment and update this ticket later this week.

@bsandmann
Copy link
Author

bsandmann commented May 28, 2024

@yshyn-iohk Here is an additional data-capture from the VM:
Identus_ressource_usage2
A few things are worth noticing:

  1. The GC seems to keep the increasing memory footprint in check. So technically this can't really be classified as a memory leak in my opinion. Nonetheless the memory allocation pattern is a bit unusual and likely more than necessary for an application at idle. Without having looked at the code I would say it is due to an sub-optimal handling of logs or strings.
  2. Between 20th and 28th the memory footprint of all applications stayed somewhat identical. Did we reach a peak or is just a lucky snapshot?
  3. The disk-usage further increased: When my calculation is correct, each instance (I'm running 2 agents) is writing about 45MB of data to the disk daily. That's nearly 1,5GB a month and around 17GB a year ... at idle. I guess the memory allocation issue is closely related to this creation of data - properly logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants