Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

[Entrypoint] Exit Code 2 even when it shouldn't #4664

Open
zenchild opened this issue May 4, 2023 · 2 comments
Open

[Entrypoint] Exit Code 2 even when it shouldn't #4664

zenchild opened this issue May 4, 2023 · 2 comments
Labels
bug Something isn't working entrypoint Things related to the entrypoint binary run inside a deployment jira Will add an Issue to Jira

Comments

@zenchild
Copy link

zenchild commented May 4, 2023

Describe the bug
Note: There is also a discussion thread about this topic.

We are noticing an issue with the waypoint-entrypoint where it always exits with an exit code 2 and does not seem to be propagating the SIGTERM to the process that it is executing. This surfaces in Nomad with the following error when stopping an allocation:

Terminated Exit Code: 2, Exit Message: “Docker container exited with non-zero exit code: 2”

Since the SIGTERM isn't propagating, the process is being hard-killed, which isn't very conducive to clean migrations to other Nomad Nodes, etc.

Steps to Reproduce

I’ve been playing a bit with this on my local machine and have managed to reproduce this with just docker. Here is an example of running without the waypoint-entrypoint and the exit is graceful:

docker run -it --rm --name demo_django_web ${DEMO_IMAGE} /app/scripts/web-manual.sh
[2023-04-10 14:46:42 +0000] [1] [INFO] Starting gunicorn 20.1.0
[2023-04-10 14:46:42 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2023-04-10 14:46:42 +0000] [1] [INFO] Using worker: gthread
[2023-04-10 14:46:42 +0000] [7] [INFO] Booting worker with pid: 7
[2023-04-10 14:46:43 +0000] [8] [INFO] Booting worker with pid: 8
[2023-04-10 14:46:46 +0000] [1] [INFO] Handling signal: term
{"name": "gunicorn.error", "processName": "MainProcess", "threadName": "MainThread", "levelname": "INFO", "message": "Worker exiting (pid: 7)", "timestamp": "2023-04-10T14:46:47.182976+00:00"}
{"name": "gunicorn.error", "processName": "MainProcess", "threadName": "MainThread", "levelname": "INFO", "message": "Worker exiting (pid: 7)", "timestamp": "2023-04-10T14:46:47.182976+00:00"}
{"name": "gunicorn.error", "processName": "MainProcess", "threadName": "MainThread", "levelname": "INFO", "message": "Worker exiting (pid: 8)", "timestamp": "2023-04-10T14:46:47.213390+00:00"}
{"name": "gunicorn.error", "processName": "MainProcess", "threadName": "MainThread", "levelname": "INFO", "message": "Worker exiting (pid: 8)", "timestamp": "2023-04-10T14:46:47.213390+00:00"}
[2023-04-10 14:46:47 +0000] [1] [INFO] Shutting down: Master
» echo $?
0

As you can see, the gunicorn process recieves the SIGTERM, exits gracefully and the exit code is 0.

If I run that exact same process, but with a waypoint-entrypoint, I see the following:

» docker run -it --rm --name demo_django_web ${DEMO_IMAGE} /app/waypoint-entrypoint /app/scripts/web-manual.sh
2023-04-10T14:47:43.359Z [INFO]  entrypoint: entrypoint starting: deployment_id="" instance_id=01GXNT65ZZAZRWHX65ZQR9YCB2 args=["/app/scripts/web-manual.sh"]
2023-04-10T14:47:43.359Z [INFO]  entrypoint: entrypoint version: full_string=v0.11.0 version=v0.11.0 prerelease="" metadata="" revision=""
2023-04-10T14:47:43.359Z [DEBUG] entrypoint.child: waiting for stateChildReady to flip to true
2023-04-10T14:47:43.359Z [DEBUG] entrypoint.child: starting child command watch loop
2023-04-10T14:47:43.359Z [DEBUG] entrypoint.child: child command received
2023-04-10T14:47:43.359Z [INFO]  entrypoint.child: starting child process: args=["/app/scripts/web-manual.sh"] cmd=/app/scripts/web-manual.sh
[2023-04-10 14:47:43 +0000] [16] [INFO] Starting gunicorn 20.1.0
[2023-04-10 14:47:43 +0000] [16] [INFO] Listening at: http://0.0.0.0:8000 (16)
[2023-04-10 14:47:43 +0000] [16] [INFO] Using worker: gthread
[2023-04-10 14:47:43 +0000] [17] [INFO] Booting worker with pid: 17
[2023-04-10 14:47:43 +0000] [18] [INFO] Booting worker with pid: 18
» echo $?                                                                                                           2 ↵
2

This time, the gunicorn process does not receive the SIGTERM and the exit code for the container is 2. Interestingly, the container shuts down almost immediately so it’s not waiting for the web process to finish.

For the same app running in Nomad, the resource event log looks like this:

Time Type Description
May 04, '23 09:49:39 -0400 Killed Task successfully killed
May 04, '23 09:49:39 -0400 Terminated Exit Code: 2, Exit Message: "Docker container exited with non-zero exit code: 2"
May 04, '23 09:49:39 -0400 Killing Sent interrupt. Waiting 5s before force killing
May 03, '23 18:10:46 -0400 Started Task started by client

In the app log, there is nothing output. Nothing from the entrypoint and nothing from the app itself.

Expected behavior
I would expect the entrypoint to propagate the SIGTERM so the application can exit gracefully and the entrypoint should also exit gracefully.

Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:

  • Waypoint CLI Version: 0.11.0
  • Waypoint Server Platform and Version: Seems to be consistent for docker and Nomad.

We are seeing this across multiple apps built multiple ways. Some with pack, some with docker files. Some python, some ruby.

Any help in resolving this is greatly appreciated. Let me know if there is any additional context we can provide.

@zenchild zenchild added the new label May 4, 2023
@paladin-devops paladin-devops added bug Something isn't working entrypoint Things related to the entrypoint binary run inside a deployment jira Will add an Issue to Jira and removed new labels May 17, 2023
@paladin-devops
Copy link
Contributor

Hey @zenchild, thanks for opening this issue! We will try to reproduce this, it looks like the SIGTERM being sent to the entrypoint isn't being handled cleanly.

@AAverin
Copy link

AAverin commented Jul 23, 2023

I am experiencing something similar but just running prometheus with nomad-pack.
What would be the good way to debug the cause of Error: 2? Container is killed so there are no logs to get via docker logs <container_id>.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working entrypoint Things related to the entrypoint binary run inside a deployment jira Will add an Issue to Jira
Projects
None yet
Development

No branches or pull requests

3 participants