Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 prevent controller.New and manager's errChan from leaking goroutines #651

Conversation

DirectXMan12
Copy link
Contributor

@DirectXMan12 DirectXMan12 commented Oct 19, 2019

This prevents a couple of goroutine leaks related to the controller and manager logic.

The first is in controller: controller.New previously created a new workqueue, which launched goroutines expecting to be stopped with Shutdown. This is semi-unexpected in CR, since we generally state that things don't happen until Start is called (and, indeed, we don't call shutdown unless start is called). This means that calling controller.New without ever starting the controller will cause a goroutine leak. We fix this by delaying workqueue initialization (and consequently source starting) until the controller's Start is called.

The second is in the manager's handling of errors -- we used to use an error channel to report errors from internal goroutines and runnables. However, we only ever read the first error from the channel, and it was a non-buffered channel. This meant that if multiple things tried to return results on the channel, all but one would block forever. This was exacerbated by the fact that the goroutines that launched controllers tried to return all results (nil values) on the channel, not just errors. We fix this by using a mutex to save a single error, using a channel's closing to signal shutdown only, and by only ever returning non-nil errors via the error signaler. This means that the only blocking is ever on a short-duration mutex.

Fixes #649

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 19, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DirectXMan12

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2019
@DirectXMan12
Copy link
Contributor Author

cc @droot

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 19, 2019
This quashes a goroutine leak caused by calling controller.New repeatedly without
calling Start.  controller.New was creating a new workqueue, which was
starting goroutines and then expecting to be shut down (by the shutdown
method, which is only called at the end of Start).

To solve that, we move workqueue initialization to controller.Start.
This means that we also move watch starting to controller.Start, but
this seems pretty sensible anyway.
@DirectXMan12 DirectXMan12 force-pushed the bug/controller-new-spawns-goroutines branch from 55ca61c to d98ce09 Compare October 19, 2019 00:22
@DirectXMan12
Copy link
Contributor Author

FYI, there's a subtle change in behavior here. I think it's the correct one and the old one was a bug, but we used to send nil results from controllers on the error channel, meaning a returned controller is treated as an error. This should never happen except in graceful shutdown, but it's still a behavior change.

type errSignaler struct {
// errSignal indicates that an error occurred, when closed. It shouldn't
// be written to.
errSignal chan struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's chan struct{} instead of chan error?
nil can be treated as error, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is being used purely for signaling that some error has occurred, it is fine.

errSignal chan struct{}

// err is the received error
err error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to change it to []error?
In SignalError(), we append errors.
In Error(), we aggregate errors into one error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...since mgr bails out on seeing first error, converting to slice of errors is not going to be of much help.

Copy link
Contributor

@droot droot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. One minor comment.

type errSignaler struct {
// errSignal indicates that an error occurred, when closed. It shouldn't
// be written to.
errSignal chan struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is being used purely for signaling that some error has occurred, it is fine.

errSignal chan struct{}

// err is the received error
err error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...since mgr bails out on seeing first error, converting to slice of errors is not going to be of much help.

pkg/manager/internal.go Show resolved Hide resolved
@DirectXMan12 DirectXMan12 force-pushed the bug/controller-new-spawns-goroutines branch from 49a24a1 to 5c7efdd Compare October 29, 2019 21:03
Instead of using an error channel in the manager, we just use a mutex
and store one error.  The "waiting" is done by blocking on a channel
which gets closed to signal errors.

This achives the same effect (only return one error) without having the
chance of blocking goroutines; with the old code, goroutines could block
trying to return their results (especially since a few tried to return
nil results), and since runnables can be added after start, there's no
way to appropriately size the channel to avoid this happening (plus no
point, since we only report the first error anyway).

We also only report errors when the occurred, never signaling for errors with
a nil error value.
Since we changed the behavior so that non-erroring runnable completion
doesn't stop the manager, this logs a log line to make it more obvious
that this has occurred, just in case.
@DirectXMan12 DirectXMan12 force-pushed the bug/controller-new-spawns-goroutines branch from 5c7efdd to 4d81992 Compare October 29, 2019 21:08
@droot droot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 29, 2019
@k8s-ci-robot k8s-ci-robot merged commit 40070e2 into kubernetes-sigs:master Oct 29, 2019
@DirectXMan12 DirectXMan12 deleted the bug/controller-new-spawns-goroutines branch November 11, 2019 19:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clean up controller.New workqueue goroutine leaks
4 participants