Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A worker shouldn't stop after getting a 503 #5666

Closed
Eijebong opened this issue Sep 8, 2022 · 5 comments · Fixed by #5667
Closed

A worker shouldn't stop after getting a 503 #5666

Eijebong opened this issue Sep 8, 2022 · 5 comments · Fixed by #5667
Assignees
Labels
bug Something isn't working

Comments

@Eijebong
Copy link
Contributor

Eijebong commented Sep 8, 2022

Right now if a worker gets a 503, it'll panic, exit and never come back.

2022/09/08 09:09:51Z WORKER_METRICS {"eventType":"workerReady","instanceType":"","region":"","timestamp":1662628191,"worker":"generic-worker","workerId":"windows-3","workerPoolId":"divvun/windows"}
2022/09/08 09:10:02Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": net/http: TLS handshake timeout
2022/09/08 09:10:12Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": net/http: TLS handshake timeout
2022/09/08 09:10:23Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": net/http: TLS handshake timeout
2022/09/08 09:10:25Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": EOF
2022/09/08 09:10:26Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": EOF
2022/09/08 09:10:29Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": EOF
2022/09/08 09:10:33Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49715->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:10:36Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49717->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:10:42Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": EOF
2022/09/08 09:10:50Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": EOF
2022/09/08 09:11:05Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49735->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:11:34Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49737->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:12:31Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": EOF
2022/09/08 09:13:14Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49740->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:14:07Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49741->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:14:56Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49742->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:15:55Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49750->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:16:42Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49751->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:17:29Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49752->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:18:40Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49753->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:19:23Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49771->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:20:05Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49773->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:20:57Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49776->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:22:01Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49777->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:23:23Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": EOF
2022/09/08 09:24:11Z Error: Post "https://divvun-tc.thetc.se/api/queue/v1/claim-work/divvun%2Fwindows": read tcp 192.168.122.185:49779->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:24:59Z Saving file file-caches.json (absolute path: c:\generic-worker\file-caches.json)
2022/09/08 09:24:59Z Saving file directory-caches.json (absolute path: c:\generic-worker\directory-caches.json)
2022/09/08 09:24:59Z Running command: icacls 'file-caches.json' /grant:r 'Administrators:(GA)' /inheritance:r
2022/09/08 09:24:59Z Running command: icacls 'directory-caches.json' /grant:r 'Administrators:(GA)' /inheritance:r
2022/09/08 09:24:59Z Running command: icacls 'tasks-resolved-count.txt' /grant:r 'Administrators:(GA)' /inheritance:r
2022/09/08 09:24:59Z goroutine 1 [running]:
runtime/debug.Stack(0x0, 0x0, 0x2a58d4)
	/task_162931984430872/go/go/src/runtime/debug/stack.go:24 +0xa5
main.HandleCrash(0x80a9e0, 0xc959d0)
	/task_162931984430872/taskcluster/workers/generic-worker/main.go:319 +0x2d
main.RunWorker.func1(0xc000727e60)
	/task_162931984430872/taskcluster/workers/generic-worker/main.go:341 +0x5c
panic(0x80a9e0, 0xc959d0)
	/task_162931984430872/go/go/src/runtime/panic.go:971 +0x49a
github.com/taskcluster/taskcluster/v44/clients/client-go.(*Client).APICall(0xc000727680, 0x7cfba0, 0xc0003d4690, 0x8a240d, 0x4, 0xc000086660, 0x1c, 0x7cfbe0, 0xc000114348, 0x0, ...)
	/task_162931984430872/taskcluster/clients/client-go/http.go:255 +0x2b3
github.com/taskcluster/taskcluster/v44/clients/client-go/tcqueue.(*Queue).ClaimWork(0xc0001f27e0, 0xc00028e800, 0xe, 0xc0003d4690, 0x2, 0xc00028e800, 0xe)
	/task_162931984430872/taskcluster/clients/client-go/tcqueue/tcqueue.go:447 +0x185
main.ClaimWork(0x0)
	/task_162931984430872/taskcluster/workers/generic-worker/main.go:540 +0x23e
main.RunWorker(0x0)
	/task_162931984430872/taskcluster/workers/generic-worker/main.go:417 +0x610
main.main()
	/task_162931984430872/taskcluster/workers/generic-worker/main.go:158 +0x739
2022/09/08 09:24:59Z  *********** PANIC occurred! *********** 
2022/09/08 09:24:59Z runtime error: invalid memory address or nil pointer dereference
2022/09/08 09:24:59Z Error: Get "https://divvun-tc.thetc.se/api/auth/v1/sentry/generic-worker/dsn": read tcp 192.168.122.185:49785->134.209.134.163:443: wsarecv: An existing connection was forcibly closed by the remote host.
2022/09/08 09:26:24Z Error: (Intermittent) HTTP response code 503
HTTP/2.0 503 Service Unavailable
Content-Length: 190
Content-Type: text/html
Date: Thu, 08 Sep 2022 09:26:24 GMT
Strict-Transport-Security: max-age=15724800; includeSubDomains

<html>
<head><title>503 Service Temporarily Unavailable</title></head>
<body>
<center><h1>503 Service Temporarily Unavailable</h1></center>
<hr><center>nginx</center>
</body>
</html>
[ ... sentry stuff ... ]
2022/09/08 09:26:48Z Exiting worker with exit code 69
2022/09/08 09:26:48Z WORKER_METRICS {"eventType":"instanceShutdown","instanceType":"","region":"","timestamp":1662629208,"worker":"generic-worker","workerId":"windows-3","workerPoolId":"divvun/windows"}

The queue service had a few issues coming back up after an update which might explain all those EOF lines (not related to taskcluster itself).

@Eijebong Eijebong added the bug Something isn't working label Sep 8, 2022
@petemoore
Copy link
Member

petemoore commented Sep 8, 2022

The root cause may have been the nil pointer error, which might have caused the panic. It isn't clear to me yet if the nil pointer happened as part of the panic handling, or was the trigger for it.

runtime error: invalid memory address or nil pointer dereference

@petemoore
Copy link
Member

Which release of generic-worker is this?

@Eijebong
Copy link
Contributor Author

Eijebong commented Sep 8, 2022

Unfortunately a pretty old one, 44.2.2

@petemoore
Copy link
Member

Looks like the bug lies in the go client here. The HTTPResponse appears to be nil and the code doesn't handle this case. Perhaps because the connection was terminated prematurely.

@Eijebong
Copy link
Contributor Author

Eijebong commented Sep 8, 2022

So, the http response is set here:

callSummary.HTTPResponse, callSummary.Attempts, err = backoffClient.Retry(httpCall)

which comes directly from here: https://github.com/taskcluster/httpbackoff/blob/main/httpbackoff.go#L123-L135

All error cases in the http request function passed to it (

httpCall := func() (*http.Response, error, error) {
) return nil for the response.

That isn't a problem unless the backoff gives up which seems to be the case in the default config after the back time goes over its given max time. https://github.com/cenkalti/backoff/blob/v3/exponential.go#L89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

Successfully merging a pull request may close this issue.

2 participants