Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Static runner hanging on aws ecs #4460

Open
izaaklauer opened this issue Jan 27, 2023 · 1 comment · May be fixed by #4854
Open

Static runner hanging on aws ecs #4460

izaaklauer opened this issue Jan 27, 2023 · 1 comment · May be fixed by #4854
Labels
bug Something isn't working intermittent jira Will add an Issue to Jira plugin/ecs

Comments

@izaaklauer
Copy link
Contributor

izaaklauer commented Jan 27, 2023

Describe the bug
I have a runner installed on aws ecs using waypoint runner install, pointing to the prod HCP waypoint server.

Currently, every remote operation behaves like this:

$ wp deploy

» Deploying acmeapp1...

» Operation is queued waiting for job "01GQTP02MT4PDYE8SCSFSP9CHC". Waiting for runner assignment...
  If you interrupt this command, the job will still run in the background.

According to waypoint job list, we're waiting for the static runner to take the StartTask job.

Here are the runner's most recent logs, according to cloudwatch:



  | 2023-01-25T18:16:57.441-05:00 | 2023-01-25T23:16:57.441Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
-- | -- | --
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-26T18:11:00.128-05:00 | 2023-01-26T23:11:00.128Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect

It's currently 2023-01-27, so it looks like the HCP server went down briefly on 2023-01-26T18:11:00.128-05:00, and it caused the runner to become stuck.

I've tcping'd the runner's health check port 1234, and it's still open.

I'd like to get in there and take a thread dump, but it looks like enabling exec on aws ecs is non-trivial, and needs to be set up before the task is launched: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html

Based on the logs, I bet it's hanging somewhere in here: https://github.com/hashicorp/waypoint/blob/main/internal/runner/accept.go#L191-L267

My money is on here:

streamCtxLock.Lock()

Or here:

if r.waitStateGreater(&r.stateConfig, stateGen) {

If we don't see the hang from a code walkthrough, we should at least add some more logging before each of those points.

Workaround

Stopping the runner task and letting ECS spin up a new one fixed the problem. The new runner was able to accept jobs.

NOTE: if you don't want the runner to start executing the full backlog of jobs that built up during the hang, cancel all Queued jobs with waypoint job cancel first.

Steps to Reproduce

  • Run a static runner on ecs
  • Wait for an eventual hang

Expected behavior
Waypoint runner should not hang

Waypoint Platform Versions
Additional version and platform information to help triage the issue if
applicable:

  • Waypoint CLI Version: 0.10.5
  • Waypoint Server Platform and Version: (like docker, nomad, kubernetes): HCP

Additional context
If anyone else sees this, add a 👍

@izaaklauer izaaklauer added new jira Will add an Issue to Jira plugin/ecs bug Something isn't working intermittent and removed new labels Jan 27, 2023
@cicoyle
Copy link
Contributor

cicoyle commented Feb 2, 2023

Saw this again in ECS:



2023-01-25T16:46:23.421-06:00 | 2023-01-25T22:46:23.421Z [DEBUG] waypoint.runner.agent.runner: sending job completion: job_id=01GQNHP22S644CP4GGJTGAAC52 job_op=*gen.Job_StopTask
-- | --
  | 2023-01-25T16:46:23.450-06:00 | 2023-01-25T22:46:23.450Z [DEBUG] waypoint.runner.agent.runner: opening job stream: retry=false
  | 2023-01-25T16:46:23.450-06:00 | 2023-01-25T22:46:23.450Z [INFO] waypoint.runner.agent.runner: waiting for job assignment
  | 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect
  | 2023-01-27T01:43:00.039-06:00 | 2023-01-27T07:43:00.038Z [WARN] waypoint.runner.agent.runner: server down before accepting a job, will reconnect


Also looks like there is a zombie odr task.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working intermittent jira Will add an Issue to Jira plugin/ecs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants