Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak Chrome #642

Closed
petternordlanderhelo opened this issue May 7, 2023 · 20 comments · Fixed by #705
Closed

Memory leak Chrome #642

petternordlanderhelo opened this issue May 7, 2023 · 20 comments · Fixed by #705
Labels
bug Something isn't working chromium

Comments

@petternordlanderhelo
Copy link

So we are running Gotenberg in Kubernetes to do PDF conversion using Chromium. Nothing fancy, like PDF/A, Libre Office etc. Simply pushing HTLM to "/forms/chromium/convert/url".

However, there seem to be a memory leak of sorts. I'm aware that the Chromium instance gets recycled every conversion which makes the memory leak wierd. Every conversion, it increases slightly in memory usage until Out of Memory - no matter memory limit. Are there any known memory leak issues with Gotenberg?

image

@gulien gulien added bug Something isn't working chromium maybe labels May 9, 2023
@gulien
Copy link
Collaborator

gulien commented May 9, 2023

Hello @petternordlanderhelo,

That's weird 🤔 I don't see any memory leaks on the demo instance, but that may be more related to its traffic volume.

Anyway, it's either:

  • A memory leak in Chromium - but thats weird, as you said, its a new instance for each conversion
  • A memory leak in the Gotenberg code directly.

Could you check what's going on by running the top command inside the container?

@JulienWeintraub
Copy link

JulienWeintraub commented Aug 4, 2023

Hello !

I would take the opportunity to continue the discuss on that curent Issue, instead of opening a new one ;)

We are also facing memory issues with Gotenberg, on our production environment, hosted on a GKE platform too.

Bellow the sizing configuration (hpa = Horizontal Pod Autoscaling) :

---
sizing:
  - hpa:
      min_scale: 1
      max_scale: 4
  - resources:
      Limits:
        memory: 1536Mi
      Requests:
        cpu: 1
        memory: 1536Mi

Bellow, a graph from a Datadog Dashboard, showing over 1-week memory, hpa behavior and all conversion requests (all /health queries are removed from the graph) :
image

We can notice the regular memory increase : drops in memory graph, are only workload redeployment or restart

Intersting point : after a couple of days/hours, I can notice remanent chrome processes for each pod (in ps/top/htop/ ...) :

Current chrome process counting in production (over 4 pods) :

= gotenberg-service-7b765b55f8-4vblm =
      34
= gotenberg-service-7b765b55f8-8wlkg =
      24
= gotenberg-service-7b765b55f8-ljnjs =
      35
= gotenberg-service-7b765b55f8-n4sfn =
      34

I recently tweaked our Gotenberg Docker image (based on the project's one), adding a custom script in background that catch the chrome processes count and log, to send to Datadog logging service.

I can notice for each new instance (pod), the total count increase.
Bellow the evolution over time (2-days period) :
image
-> here, we see for each one, the count increase.

ℹ️ Just a warn about the count graph, it's with time range aggregation.
By zooming in on a shorter period, we have same values like my above check in CLI :
image

Shouldn't there be a decrease, for both memory and Chrome process count, when activity becomes lower ?

Thanks a lot for your help, if you have any advice about that issue :/

@gulien
Copy link
Collaborator

gulien commented Aug 4, 2023

Hello @JulienWeintraub,

Thank you for your insightful comment!

Under normal circumstances, a Chromium browser—comprising X processes—should initiate and then close down with each conversion. This implies that, absent any active requests, there should be no running processes.

I have a couple of questions to understand this better:

  1. Could these be orphaned processes?
  2. Could they be zombie processes? In other words, could the entrypoint have been overridden, resulting in tini no longer being in use?

@petternordlanderhelo
Copy link
Author

Nice input @JulienWeintraub!

We seem to also have chrome processes left over. This screen is from "top" in a gotenberg container that has been running for about an hour with no active conversions/requests during the time of entry - and low request rate. Seems like a chrome instance is left alive after at least some of the conversions. I have no stats of this. Also there are a lot of suspect "empty" (can be the tooling) "cat" and "chrome" processes as well without any command line etc. Kubernetes in AWS.

image

@JulienWeintraub
Copy link

@petternordlanderhelo It seems that we are facing same issues, right :)

@gulien some feedbacks for your questions :

Could these be orphaned processes?
Could they be zombie processes? In other words, could the entrypoint have been overridden, resulting in tini no longer being in use?

Not for now, but we already had some chromium defunct processes, when we had OOM issues (we found a workaround to sequentially send conversion requests to Gotenberg from our caller service).

The TOP result once we get them :

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
gotenbe+ 10280  0.0  0.0   3952  3120 pts/2    Ss+  14:03   0:00 bash
gotenbe+    17  0.4  0.0   3952  2436 pts/1    Ss   14:02   0:01 bash
gotenbe+ 14989  0.0  0.0   6440  2544 pts/1    R+   14:08   0:00  \_ ps faux
gotenbe+     1  0.0  0.0   1944   224 pts/0    Ss   14:01   0:00 /usr/bin/tini -- gotenberg
gotenbe+     7  0.7  0.1 798112 10668 pts/0    Sl+  14:01   0:03 gotenberg
gotenbe+  8517  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  8786  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  8937  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9038  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9058  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9146  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9148  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9151  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9155  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9186  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9187  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9256  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9266  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9267  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9270  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9272  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9289  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9297  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>
gotenbe+  9310  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>

Here, chromes are not zombies processes, and all have 1 PPID value (the tini process)

But to be more accurate, I effectively overridden the tini in 2 cases :

  • The first one with only gotenberg app and extra parameters (through CMD)
  • The second one (the current running) with a custom script, that load my custom script to catch chrome process count, and after execute gotenberg with extra parameters

-> Result is same, for memory issues

Bellow my ps result, you can see the process instantiation, tini is the init 1 process :

UID          PID    PPID  C STIME TTY          TIME CMD
gotenbe+  137057       0  0 14:14 pts/0    00:00:00 bash
gotenbe+  137131  137057  0 14:17 pts/0    00:00:00  \_ ps -fe --forest
gotenbe+       1       0  0 Aug02 ?        00:00:07 /usr/bin/tini -- entrypoint.sh gotenberg --api-port=3000 --api-timeout=60s --chromium-failed-starts-threshold=5 --log-format=auto --log-level=info --log-fields-prefix=gotenberg --gotenberg-graceful-shutdown-duration=30s
gotenbe+       7       1  0 Aug02 ?        00:00:00 /bin/bash /usr/bin/entrypoint.sh gotenberg --api-port=3000 --api-timeout=60s --chromium-failed-starts-threshold=5 --log-format=auto --log-level=info --log-fields-prefix=gotenberg --gotenberg-graceful-shutdown-duration=30s
gotenbe+       8       7  0 Aug02 ?        00:00:00  \_ /bin/bash /usr/bin/entrypoint.sh gotenberg --api-port=3000 --api-timeout=60s --chromium-failed-starts-threshold=5 --log-format=auto --log-level=info --log-fields-prefix=gotenberg --gotenberg-graceful-shutdown-duration=30s
gotenbe+      10       8  0 Aug02 ?        00:02:08  |   \_ python3 -u chrome_process_counter.py
gotenbe+       9       7  0 Aug02 ?        00:04:33  \_ gotenberg --api-port=3000 --api-timeout=60s --chromium-failed-starts-threshold=5 --log-format=auto --log-level=info --log-fields-prefix=gotenberg --gotenberg-graceful-shutdown-duration=30s
gotenbe+    5113       1  0 Aug02 ?        00:00:00 cat
gotenbe+    5141       1  0 Aug02 ?        00:00:00 /opt/google/chrome/chrome --type=renderer --headless --crashpad-handler-pid=5116 --first-renderer-process --no-sandbox --disable-dev-shm-usage --disable-background-timer-throttling --disable-breakpad --enable-automation --force-color-profile=srgb --remote-debugging-port=0 --allow-pre-commit-input --ozone-platform=headless --disable-gpu-compositing --lang=en-US --num-raster-threads=1 --renderer-client-id=4 --time-ticks-at-unix-epoch
gotenbe+    7848       1  0 Aug02 ?        00:00:00 cat

@petternordlanderhelo
Copy link
Author

I have a large amount of these as well. PPID 1 and seems a bit strange to be haning around.

image

@gulien
Copy link
Collaborator

gulien commented Aug 4, 2023

The first one with only gotenberg app and extra parameters (through CMD)

In that case you did not, CMD overrides the "command", not the "entrypoint" 😄

Anyway, it looks like we do have zombies which are not reaped by tini. I have not idea why because their PPID is indeed 1, i.e., tini.

@gulien
Copy link
Collaborator

gulien commented Aug 4, 2023

For your information, the demo instance has successfully managed over 2,500 requests in the last 24 hours, predominantly through the /forms/chromium/convert/html route. At present, there are no zombie processes. 🤔

The memory profile (ignore the missing data points):

Capture d’écran 2023-08-04 à 17 11 03

@gulien
Copy link
Collaborator

gulien commented Aug 4, 2023

(Sorry for the spam)

  • Are you using the latest Gotenberg version (demo instance have been updated 8 days ago, no zombies)?
  • Are you on a amd64 platform?

@JulienWeintraub
Copy link

JulienWeintraub commented Aug 4, 2023

In that case you did not, CMD overrides the "command", not the "entrypoint" 😄

Bad words from my side, my bad ;)

Are you using the latest Gotenberg version (demo instance have been updated to the latest 8 days ago, no zombies)?

Sure, from the v7.9.1 version (just tweaked with my custom script) :


  _____     __           __
 / ___/__  / /____ ___  / /  ___ _______ _
/ (_ / _ \/ __/ -_) _ \/ _ \/ -_) __/ _ '/
\___/\___/\__/\__/_//_/_.__/\__/_/  \_, /
                                   /___/

A Docker-powered stateless API for PDF files.
Version: 7.9.1
-------------------------------------------------------

Are you on a amd64 platform?

Yes :

kubernetes.io/arch: amd64
kubernetes.io/os: linux

On my side, 3.36k requests over 24h, but don't there that make a big difference.

But as I said, there is no zombie process too on my side (the defunct ones were before a workaround client-side implementation).

Just remanent chrome processes over time :/

@gulien
Copy link
Collaborator

gulien commented Aug 4, 2023

gotenbe+  9146  0.0  0.0      0     0 pts/0    Z+   14:03   0:00  \_ [chromium] <defunct>

The Z+ indicates this is a zombie 🤔

Z    defunct ("zombie") process, terminated but not reaped by its parent.
+    is in the foreground process group.

What do I miss?

@JulienWeintraub
Copy link

@gulien Nothing, but that table was BEFORE our client-side workaround.

Now we have chrome processes, but not zombies ones !

I just redeployed the production service, but with a lower activity :


    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 gotenbe+  20   0    2384    576    504 S   0.3   0.0   0:00.38 tini
      7 gotenbe+  20   0    4036   2892   2604 S   0.0   0.0   0:00.00 entrypoint.sh
      8 gotenbe+  20   0    4036   2128   1784 S   0.0   0.0   0:00.00 entrypoint.sh
      9 gotenbe+  20   0  731244  26072  16652 S   0.0   0.3   0:24.03 gotenberg
     10 gotenbe+  20   0   15652  11956   6212 S   0.0   0.1   0:02.90 python3
   1569 gotenbe+  20   0    2544    580    504 S   0.0   0.0   0:00.00 cat
   1570 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
   1572 gotenbe+  20   0   32.0g   3032   2740 S   0.0   0.0   0:00.00 chrome_crashpad
   1596 gotenbe+  20   0 1130.8g 133356  97124 S   0.0   1.6   0:00.29 chrome
   2815 gotenbe+  20   0    2544    584    504 S   0.0   0.0   0:00.00 cat
   2816 gotenbe+  20   0    2544    580    504 S   0.0   0.0   0:00.00 cat
   2818 gotenbe+  20   0   32.0g   2980   2684 S   0.0   0.0   0:00.00 chrome_crashpad
   2843 gotenbe+  20   0 1130.8g 132704  96628 S   0.0   1.6   0:00.30 chrome
   8165 gotenbe+  20   0    2544    580    504 S   0.0   0.0   0:00.00 cat
   8166 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
   8168 gotenbe+  20   0   32.0g   3064   2772 S   0.0   0.0   0:00.00 chrome_crashpad
   8193 gotenbe+  20   0 1130.8g 139156  97412 S   0.0   1.7   0:00.37 chrome
  11081 gotenbe+  20   0    2544    580    504 S   0.0   0.0   0:00.00 cat
  11082 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  11084 gotenbe+  20   0   32.0g   3076   2776 S   0.0   0.0   0:00.00 chrome_crashpad
  11109 gotenbe+  20   0 1130.8g 135268  95268 S   0.0   1.7   0:00.34 chrome
  14656 gotenbe+  20   0    2544    520    440 S   0.0   0.0   0:00.00 cat
  14659 gotenbe+  20   0   32.0g   3436   3140 S   0.0   0.0   0:00.00 chrome_crashpad
  14685 gotenbe+  20   0 1130.8g 137284  95840 S   0.0   1.7   0:00.35 chrome
  17095 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  17096 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  17098 gotenbe+  20   0   32.0g   3132   2836 S   0.0   0.0   0:00.00 chrome_crashpad
  17123 gotenbe+  20   0 1130.8g 136980  96728 S   0.0   1.7   0:00.33 chrome
  18105 gotenbe+  20   0    2544    584    504 S   0.0   0.0   0:00.00 cat
  18108 gotenbe+  20   0   32.0g   3088   2800 S   0.0   0.0   0:00.00 chrome_crashpad
  18133 gotenbe+  20   0 1130.8g 136284  94572 S   0.0   1.7   0:00.40 chrome
  21125 gotenbe+  20   0    2544    520    440 S   0.0   0.0   0:00.00 cat
  21126 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  21128 gotenbe+  20   0   32.0g   3016   2720 S   0.0   0.0   0:00.00 chrome_crashpad
  21154 gotenbe+  20   0 1130.8g 118696  88316 S   0.0   1.5   0:00.18 chrome
  22857 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  22858 gotenbe+  20   0    2544    580    504 S   0.0   0.0   0:00.00 cat
  22860 gotenbe+  20   0   32.0g   3076   2776 S   0.0   0.0   0:00.00 chrome_crashpad
  22885 gotenbe+  20   0 1130.8g 140716  99928 S   0.0   1.7   0:00.33 chrome
  23274 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  23275 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  23277 gotenbe+  20   0   32.0g   3028   2728 S   0.0   0.0   0:00.00 chrome_crashpad
  23304 gotenbe+  20   0 1130.8g 139384  99344 S   0.0   1.7   0:00.34 chrome
  23984 gotenbe+  20   0    2544    520    440 S   0.0   0.0   0:00.00 cat
  23985 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  23987 gotenbe+  20   0   32.0g   2892   2584 S   0.0   0.0   0:00.00 chrome_crashpad
  24011 gotenbe+  20   0 1130.8g 136004  96680 S   0.0   1.7   0:00.32 chrome
  26847 gotenbe+  20   0    2544    584    504 S   0.0   0.0   0:00.00 cat
  26848 gotenbe+  20   0    2544    516    440 S   0.0   0.0   0:00.00 cat
  26850 gotenbe+  20   0   32.0g   3092   2804 S   0.0   0.0   0:00.00 chrome_crashpad
  26875 gotenbe+  20   0 1130.8g 136104  97036 S   0.0   1.7   0:00.31 chrome
  27913 gotenbe+  20   0    2544    580    504 S   0.0   0.0   0:00.00 cat
  27914 gotenbe+  20   0    2544    520    440 S   0.0   0.0   0:00.00 cat
  27916 gotenbe+  20   0   32.0g   3124   2824 S   0.0   0.0   0:00.00 chrome_crashpad
  27940 gotenbe+  20   0 1130.8g 136352  96040 S   0.0   1.7   0:00.36 chrome
  31369 gotenbe+  20   0    4168   3324   2904 S   0.0   0.0   0:00.00 bash

-> No defunct/zombie/orphan processes

@gulien
Copy link
Collaborator

gulien commented Aug 4, 2023

Thanks 😄

I wonder if it might be related to: https://medium.com/@felixge/killing-a-child-process-and-all-of-its-children-in-go-54079af94773. Fun fact: I do have a working implementation for all processes but Chromium, as the execution is handled by the chromedp library. See also chromedp/chromedp#752.

@gulien
Copy link
Collaborator

gulien commented Aug 5, 2023

@petternordlanderhelo

  1. It appears you have a significant number of zombie processes (refer to this comment - they're indicated with Z). It's peculiar since tini is expected to reap them.
  2. Regarding this comment - are these also zombie processes?

@JulienWeintraub

  1. Despite the OOM and a surge in zombie processes, I'm puzzled as to why tini couldn't reap them.
  2. Presently, you're not seeing zombie processes, just an accumulation of Chromium processes.

In both scenarios, chrome_crashpad and cat processes are evident.

I'm kinda lost.

@gulien
Copy link
Collaborator

gulien commented Aug 5, 2023

@makarychev13

Is there evidence of memory leaks, zombie processes, or lingering processes on your end?
Additionally, are there any specific issues or nuances when using Kubernetes?

@gulien
Copy link
Collaborator

gulien commented Aug 10, 2023

Ok so a quick follow up:

Recommended way is now to use a single browser instance and open a new tab per conversion. It comes with its own limitations, like no more than 6 concurrent conversions for instance. Also, it means a consequent rewrite of the Chromium module.

@gulien gulien removed the maybe label Aug 10, 2023
@gulien
Copy link
Collaborator

gulien commented Aug 11, 2023

@petternordlanderhelo @JulienWeintraub

Do you have 503 responses from Gotenberg? I wonder if it could be related to the hanging processes.

@JulienWeintraub
Copy link

JulienWeintraub commented Aug 14, 2023

Hello @gulien !

I faced 28 503 issues from Gotenberg, over a 2-weeks periods ...
image

Not sure it's very relevant, but do you consider that low number could nevertheless be part of our rootcause ? 🤔

Alternatively, I temporarily increased Memory allocation and max replica, and I can notice despite of activity decrease, memory still continuously grows ... (no redeployment thanks to the 8 x 2GB of memory).

image

@gulien
Copy link
Collaborator

gulien commented Aug 14, 2023

I had a tiny hope that it may be related to the « abrupt » closing of browser instances, but, given the very few cases, that’s clearly not the root cause.

🫠

@user7788
Copy link

user7788 commented Sep 16, 2023

I have this errors in CloudRun logs (7.9.1 - cloudrun) after each chrome conversion, not sure if it's related, the conversions are successful:
{
insertId: "65050e11000e41b528a15c40"
labels: {1}
logName: "projects/project/logs/run.googleapis.com%2Fvarlog%2Fsystem"
receiveTimestamp: "2023-09-16T02:08:17.939143133Z"
resource: {2}
severity: "ERROR"
textPayload: "Uncaught signal: 13, pid=35008, tid=35008, fault_addr=0."
timestamp: "2023-09-16T02:08:17.934289501Z"
}

@gulien gulien mentioned this issue Oct 30, 2023
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working chromium
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants