Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error 502 after gunicorn worker auto-restart #14691

Closed
v0tti opened this issue Jan 4, 2024 · 18 comments · Fixed by #14924
Closed

Error 502 after gunicorn worker auto-restart #14691

v0tti opened this issue Jan 4, 2024 · 18 comments · Fixed by #14924
Assignees
Labels
severity: low Does not significantly disrupt application functionality, or a workaround is available status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application

Comments

@v0tti
Copy link
Contributor

v0tti commented Jan 4, 2024

Deployment Type

Self-hosted

NetBox Version

v3.7.0

Python Version

3.9

Steps to Reproduce

This is a know issue with gunicorn that was introduced with the update of the gunicorn version in NetBox v3.5.7. Steps to reproduce can be found at benoitc/gunicorn#3038. Also some NetBox users are reporting these errors since v3.5.7 in that issue.

Expected Behavior

I do not expect to get an error 502 for common requests without an actual fault.

Observed Behavior

I get an error 502 when multiple sequential requests are sent to gunicorn with the default gunicorn version and configuration.

As there is not much happening with the gunicorn issue and pull request, I would like to suggest to downgrade gunicorn to version 20.1.0.

@v0tti v0tti added the type: bug A confirmed report of unexpected behavior in the application label Jan 4, 2024
@jeremystretch
Copy link
Member

jeremystretch commented Jan 4, 2024

There are a few points to consider here:

  1. AFAICT this is a bug entirely contained in gunicorn itself, and not something that we can fix with a change to NetBox code.
  2. NetBox currently prescribes the installation of gunicorn v21.2.0, which is marked as the current stable release.
  3. The proposed previous release (20.1.0) is more than two years older than the current release, and would give up significant features and bug fixes, in addition (apparently) to Python 3.11 support.
  4. The current gunicorn version has been in place for roughly six months, and this is the first time the issue has surfaced as a bug. (I'll readily admit it's likely been discussed elsewhere by NetBox users and just not been on my radar.)
  5. Users are free to install whichever version of gunicorn they want, regardless of what NetBox prescribes.
  6. Downgrading gunicorn to an older release might break things (like Python 3.11 support).

All this is to say I'm not sure where I stand on the proposal. I think my biggest issue with it is that forcing a downgrade of gunicorn on the next NetBox release could potentially cause more problems for some users than the bug at issue here. This puts us in a difficult position.

@jeremystretch jeremystretch added the status: under review Further discussion is needed to determine this issue's scope and/or implementation label Jan 4, 2024
@v0tti
Copy link
Contributor Author

v0tti commented Jan 4, 2024

I fully understand that downgrading may not be an option, especially if we lose Python 3.11 support.
Maybe the existence of this problem could be mentioned in the docs?
I hope gunicorn will fix this bug soon, and if so, I think the dependency in NetBox should be updated.

@jeremystretch jeremystretch pinned this issue Jan 4, 2024
@jeremystretch
Copy link
Member

I suppose it's worth explicitly calling out the workaround for anyone who happens upon this issue: Gunicorn can be downgraded to v20.1.0 for an existing NetBox deployment using pip.

$ pip install gunicorn==20.1.0
Collecting gunicorn==20.1.0
  Using cached gunicorn-20.1.0-py3-none-any.whl (79 kB)
Requirement already satisfied: setuptools>=3.0 in ./venv/lib/python3.10/site-packages (from gunicorn==20.1.0) (59.6.0)
Installing collected packages: gunicorn
  Attempting uninstall: gunicorn
    Found existing installation: gunicorn 21.2.0
    Uninstalling gunicorn-21.2.0:
      Successfully uninstalled gunicorn-21.2.0
Successfully installed gunicorn-20.1.0

@markkuleinio
Copy link
Contributor

Debian 12 (on which I plan to run all my NetBox 3.7.x setups) has Python 3.11, so I'm really interested in this issue. Currently trying to find out what the "Python 3.11 support" mentioned in Gunicorn 21.x actually means.

@jeremystretch
Copy link
Member

FWIW I don't expect that earlier versions of gunicorn won't work on Python 3.11, just that they aren't specifically supported (or tested).

@nomad-cyanide
Copy link

I suppose it's worth explicitly calling out the workaround for anyone who happens upon this issue: Gunicorn can be downgraded to v20.1.0 for an existing NetBox deployment using pip.

$ pip install gunicorn==20.1.0
Collecting gunicorn==20.1.0
  Using cached gunicorn-20.1.0-py3-none-any.whl (79 kB)
Requirement already satisfied: setuptools>=3.0 in ./venv/lib/python3.10/site-packages (from gunicorn==20.1.0) (59.6.0)
Installing collected packages: gunicorn
  Attempting uninstall: gunicorn
    Found existing installation: gunicorn 21.2.0
    Uninstalling gunicorn-21.2.0:
      Successfully uninstalled gunicorn-21.2.0
Successfully installed gunicorn-20.1.0

Would this work-around be a way forward for me, in this problem I have described here: #14730 ?

@v0tti
Copy link
Contributor Author

v0tti commented Jan 8, 2024

Would this work-around be a way forward for me, in this problem I have described here: #14730 ?

Maybe. You should look into your gunicorn logs in systemd and check if you see something like

Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Autorestarting worker after current request.
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Worker exiting (pid: 579373)
Aug 10 13:00:07 hostname gunicorn[648814]: [2023-08-10 13:00:07 +0000] [648814] [INFO] Booting worker with pid: 648814

there.

@nomad-cyanide
Copy link

nomad-cyanide commented Jan 8, 2024

Would this work-around be a way forward for me, in this problem I have described here: #14730 ?

Maybe. You should look into your gunicorn logs in systemd and check if you see something like

Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Autorestarting worker after current request.
Aug 10 13:00:06 hostname gunicorn[579373]: [2023-08-10 13:00:06 +0000] [579373] [INFO] Worker exiting (pid: 579373)
Aug 10 13:00:07 hostname gunicorn[648814]: [2023-08-10 13:00:07 +0000] [648814] [INFO] Booting worker with pid: 648814

there.

And I do.

Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Autorestarting worker after current request.
Jan 08 14:39:25 30001vmnb02-prod gunicorn[1129991]: [2024-01-08 14:39:25 +0100] [1129991] [INFO] Worker exiting (pid: 1129991)
Jan 08 14:39:26 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:39:26 +0100] [1139845] [INFO] Booting worker with pid: 1139845
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Autorestarting worker after current request.
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1129962]: [2024-01-08 14:44:11 +0100] [1129962] [INFO] Worker exiting (pid: 1129962)
Jan 08 14:44:11 30001vmnb02-prod gunicorn[1139926]: [2024-01-08 14:44:11 +0100] [1139926] [INFO] Booting worker with pid: 1139926
Jan 08 14:46:09 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:46:09 +0100] [1139845] [INFO] Autorestarting worker after current request.
Jan 08 14:46:09 30001vmnb02-prod gunicorn[1139845]: [2024-01-08 14:46:09 +0100] [1139845] [INFO] Worker exiting (pid: 1139845)
Jan 08 14:46:10 30001vmnb02-prod gunicorn[1139967]: [2024-01-08 14:46:10 +0100] [1139967] [INFO] Booting worker with pid: 1139967
Jan 08 14:50:27 30001vmnb02-prod gunicorn[1139967]: [2024-01-08 14:50:27 +0100] [1139967] [INFO] Autorestarting worker after current request.
Jan 08 14:50:27 30001vmnb02-prod gunicorn[1139967]: [2024-01-08 14:50:27 +0100] [1139967] [INFO] Worker exiting (pid: 1139967)
Jan 08 14:50:28 30001vmnb02-prod gunicorn[1140032]: [2024-01-08 14:50:28 +0100] [1140032] [INFO] Booting worker with pid: 1140032
Jan 08 14:50:45 30001vmnb02-prod gunicorn[1129800]: [2024-01-08 14:50:45 +0100] [1129800] [INFO] Autorestarting worker after current request.
Jan 08 14:50:45 30001vmnb02-prod gunicorn[1129800]: [2024-01-08 14:50:45 +0100] [1129800] [INFO] Worker exiting (pid: 1129800)
Jan 08 14:50:46 30001vmnb02-prod gunicorn[1140042]: [2024-01-08 14:50:46 +0100] [1140042] [INFO] Booting worker with pid: 1140042
Jan 08 14:55:20 30001vmnb02-prod gunicorn[1140042]: [2024-01-08 14:55:20 +0100] [1140042] [INFO] Autorestarting worker after current request.
Jan 08 14:55:20 30001vmnb02-prod gunicorn[1140042]: [2024-01-08 14:55:20 +0100] [1140042] [INFO] Worker exiting (pid: 1140042)
Jan 08 14:55:21 30001vmnb02-prod gunicorn[1140117]: [2024-01-08 14:55:21 +0100] [1140117] [INFO] Booting worker with pid: 1140117
Jan 08 14:56:39 30001vmnb02-prod gunicorn[1107507]: [2024-01-08 14:56:39 +0100] [1107507] [INFO] Autorestarting worker after current request.
Jan 08 14:56:39 30001vmnb02-prod gunicorn[1107507]: [2024-01-08 14:56:39 +0100] [1107507] [INFO] Worker exiting (pid: 1107507)
Jan 08 14:56:40 30001vmnb02-prod gunicorn[1140142]: [2024-01-08 14:56:40 +0100] [1140142] [INFO] Booting worker with pid: 1140142
Jan 08 15:01:20 30001vmnb02-prod gunicorn[1140142]: [2024-01-08 15:01:20 +0100] [1140142] [INFO] Autorestarting worker after current request.
Jan 08 15:01:21 30001vmnb02-prod gunicorn[1140142]: [2024-01-08 15:01:21 +0100] [1140142] [INFO] Worker exiting (pid: 1140142)
Jan 08 15:01:21 30001vmnb02-prod gunicorn[1140221]: [2024-01-08 15:01:21 +0100] [1140221] [INFO] Booting worker with pid: 1140221

If I'm honest, gunicorn 21.2.0 doesn't really seem production ready, if this is the behavior we can expect.

@markkuleinio
Copy link
Contributor

Note that gunicorn's autorestart itself is expected and desired (as investigated in earlier NetBox issues, search here for max_requests), but it behaves differently in gunicorn 21.2.0, causing resets in connections.

In any case, supporting the issue at benoitc/gunicorn#3038 is probably the way to affect gunicorn.

@v0tti
Copy link
Contributor Author

v0tti commented Jan 9, 2024

Note that gunicorn's autorestart itself is expected and desired (as investigated in earlier NetBox issues, search here for max_requests), but it behaves differently in gunicorn 21.2.0, causing resets in connections.

Sure, I should have added that these autorestarts should correlate with the 502s that are observed.

@nomad-cyanide
Copy link

I have been testing a few things and this is my finding.
It doesn't help if I decrease the max_requests setting. It just fails sooner.
it doesn't help if I increase the max_requests setting either. It still fails at some point.
When it fails a log-entry about restarting shows up simultaneously:

Example:

an 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Autorestarting worker after current request.
Jan 09 13:20:07 30001vmnb02-test gunicorn[106082]: [2024-01-09 13:20:07 +0100] [106082] [INFO] Worker exiting (pid: 106082)
Jan 09 13:20:08 30001vmnb02-test gunicorn[106256]: [2024-01-09 13:20:08 +0100] [106256] [INFO] Booting worker with pid: 106256

If I set the max_requests to 0, thus disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.

Perhaps a scheduled restart of the Netbox and netbox-rq services once a day would do the trick?

@nomad-cyanide
Copy link

If I set the max_requests to 0, thus disabling it, my scripts work. without error. But is this preferable to having gunicorn processes restart regularly. I suppose it would start consuming memory, if it has memory-leak errors that is.

Perhaps a scheduled restart of the Netbox and netbox-rq services once a day would do the trick?

I have come to the conclusion, that rather than downgrade gunicorn of maybe loose some necessary features, I will go ahead with max_requests set i 0 and if memory usage becomes an issue on the server I will set up a scheduled job that restarts the worker processes with this command:

ps -aux | grep venv/bin/gunicorn | grep Ss | awk '{ system("kill -HUP " $2 )}'

@MANT5149
Copy link

We ran into this issue as well. We have several scripts using the NetBox API, and they began having intermittent issues after the gunicorn upgrade to 21.2.0. We've been downgrading to 20.1.0 which works flawlessly. Recently did an in-place upgrade from Debian 11 to 12 and everything is still working smooth with 20.1.0 and NB 3.7.0.

@markkuleinio
Copy link
Contributor

I'd appreciate observations:

  • commenting out max_requests in gunicorn.py (= no restarts) + restarting netbox+netbox-rq
  • observing the memory consumption of the gunicorn processes over time with NetBox 3.7.0.

Unfortunately my current prod NetBoxes are 3.3.x so its no use doing this there, and my 3.7.0 test boxes don't get nearly enough of legitime use to tell if the gunicorn instances are still leaking memory or not.

@tyler-8
Copy link
Contributor

tyler-8 commented Jan 11, 2024

I'll just add that I've been running NetBox using uwsgi for years now and it's worked without any noticeable issue.

@markkuleinio
Copy link
Contributor

I see NetBox maintainers also mentioned uWSGI in their meeting: #14777

Do you want to share your uWSGI configs and relevant systemd (etc) files? It's totally new area for me, even though I know how gunicorn+systemd combo works.

@markkuleinio
Copy link
Contributor

Discussion #14780 opened about uWSGI configuration

@jeremystretch
Copy link
Member

We've added a note in the installation docs for now. Will keep an eye on this; hoping for a new gunicorn release soon. 🤞

@jeremystretch jeremystretch added status: accepted This issue has been accepted for implementation severity: low Does not significantly disrupt application functionality, or a workaround is available and removed status: under review Further discussion is needed to determine this issue's scope and/or implementation labels Jan 25, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
severity: low Does not significantly disrupt application functionality, or a workaround is available status: accepted This issue has been accepted for implementation type: bug A confirmed report of unexpected behavior in the application
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants