Skip to content
This repository has been archived by the owner on Jan 8, 2024. It is now read-only.

Not interpreting server NotFound error as server down #4854

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

izaaklauer
Copy link
Contributor

@izaaklauer izaaklauer commented Jul 21, 2023

Fixes #4460

How does the runner get stuck?

Here's the failure scenario:

1: The server and runner are both connected and fine.

2: A new job arrives, and the server decides to assign it to our runner.

3: Our first problem - the server has an issue assigning the job to a runner, and it's a NotFound-type error. We've seen it happen here, while getting config-sourcers.

return hcerr.Externalize(log, err, "failed to get the configuration for a dynamic source plugin to send with job assignment")
. There's likely another server bug at play here making it semi-regularily return NotFound errors, but fundamentally, returning a NotFound error here is something the server is allowed to do.

4: The runner receives this error, and because it's a NotFound error, it calls goto RESTART_JOB_STREAM (here), which we should only do if we've actually disconnected from the server. This is the bug. Relevant chunk:

resp, err := client.Recv()
if err != nil {
if atomic.LoadInt32(&canceled) > 0 ||
status.Code(err) == codes.Unavailable ||
status.Code(err) == codes.NotFound {
goto RESTART_JOB_STREAM
}
return err
}

5: The runner follows the goto, and then blocks here forever:

if r.waitStateGreater(&r.stateConfig, stateGen) {

Why is the runner blocking there, you ask? Here's my best understanding:

If this were a true disconnect, there is a second goroutine that would also be attempting to reconnect - the recvConfig loop (here). It has sophisticated reconnect logic, and if it were to also have seen the disconnect, it would reconnect, and then increment the runner-wide state version here, which would unblock the accept goroutine. @martisah and I verified this through observation.

That isn't happening here, because the recvConfig goroutine hasn't seen an error case, so it never has a reason to modify the runner state.

How does this fix it?

The primary cause is that the runner's accept goroutine is interpreting any NotFound error from the server as a disconnect event, which it expects all other goroutines will see. That's a bad assumption.

This PR modifies the accept goroutine to exit if it sees a NotFound error, rather than attempt a server reconnect. This raised a new bug: the top-level AcceptMany function was interpreting the server sending NotFound on the runner job stream as a deregistration event:

case codes.NotFound:
// This means the runner was deregistered and we must exit.
// This won't be fixed unless the runner is closed and restarted.
r.logger.Error("runner unexpectedly deregistered, exiting")
return

That's also a bad assumption. The server in fact sends FailedPrecondition (here).

This fixes that second bug by again interpreting NotFound as a generic error. The AcceptMany NotFound check was introduced to prevent a runner DDOS event (here), and this continues to protect against that by moving the sleep before reconnect to the default case.

tl;dr: with this change, if the server responds to the runner with a NotFound error, the Accept loop will log the error, exit, and then be restarted successfully by AcceptMany, ready to accept another job.

How can I reproduce this bug?

  • Install a runner in ECS
  • Create a waypoint project with ODR and configsourcers, pointed to the HCP waypoint server
  • Set the static runner's concurrency flag to 1
  • Somehow get the server to respond with NotFound on the runner job stream. It's not yet clear the exact steps to reproduce this, but we got there by running some number of waypoint up's, destroying the project, recreating the project, and repeating.
  • At this point, the static runner should log server down before accepting a job, will reconnect, and then refuse to accept additional jobs. You will notice continued logs from waypoint.runner.agent.runner.config.watcher, but none from the accept loop.

Future considerations

  • We intend to push some of the debugging logging we introduced to find this issue
  • We should investigate why the server is responding NotFound on the job stream
  • We should verify further that runner deregistration still works. In my testing, I observed that with this change, the runner handles one additional job before exiting, but that may be preexisting behavior.

Comment on lines +12 to +15

projConfig "github.com/hashicorp/waypoint/internal/config"
"github.com/hashicorp/waypoint/internal/core"
pb "github.com/hashicorp/waypoint/pkg/server/gen"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated auto-formatting fix

Co-authored-by: Martina Santangelo <martina.santangelo@hashicorp.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Static runner hanging on aws ecs
2 participants