Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce goroutine leakage in test/integration/controlplane/transformation #111739

Closed
wants to merge 1 commit into from

Conversation

sanwishe
Copy link
Contributor

@sanwishe sanwishe commented Aug 7, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

fix leaking goroutines in test/integration/controlplane/transformation

Which issue(s) this PR fixes:

Fixes #111674

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 7, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @sanwishe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/apiserver labels Aug 7, 2022
@k8s-ci-robot k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 7, 2022
@enj
Copy link
Member

enj commented Aug 7, 2022

Instead of a manual stop method that is a no-op in many cases I think it would be better to have a context be used to cancel any implementations that explicitly need it.

Copy link
Contributor

@yangjunmyfm192085 yangjunmyfm192085 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 8, 2022
@@ -198,7 +198,7 @@ func EtcdMain(tests func() int) {
// like k8s.io/klog/v2.(*loggingT).flushDaemon()
// TODO(#108483): Reduce this number once we address the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToDo can be deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully can be deleted soon, but there are still some other leaks #108483 (comment)

@aojea
Copy link
Member

aojea commented Aug 8, 2022

Instead of a manual stop method that is a no-op in many cases I think it would be better to have a context be used to cancel any implementations that explicitly need it.

@enj this is coming from #108483, it is not about context cancellation it is about storage shutdown, it needs a function that allows to stop and destroy all resources

var once sync.Once
destroyFunc := func() {
// we know that storage destroy funcs are called multiple times (due to reuse in subresources).
// Hence, we only destroy once.
// TODO: fix duplicated storage destroy calls higher level
once.Do(func() {
transformer.Stop()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does order matters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It matters, and i have move it after client.close()

@aojea
Copy link
Member

aojea commented Aug 8, 2022

/assign @wojtek-t

he is more familiar with this code

@sanwishe sanwishe force-pushed the goroutineleak1 branch 2 times, most recently from b857165 to 110c4bc Compare August 8, 2022 08:06
@yangjunmyfm192085
Copy link
Contributor

/retest

@enj
Copy link
Member

enj commented Aug 8, 2022

Instead of a manual stop method that is a no-op in many cases I think it would be better to have a context be used to cancel any implementations that explicitly need it.

@enj this is coming from #108483, it is not about context cancellation it is about storage shutdown, it needs a function that allows to stop and destroy all resources

I understand what this code is trying to do. Contexts can be used to control lifetime, in the same way a manual stop method can. I do not want to expand the storage transformer interface with a method that one can forget to call. Instead, any implementation that needs any form of cleanup should take a context in its constructor and automatically perform said cleanup when the context is canceled.

@enj
Copy link
Member

enj commented Aug 8, 2022

/assign

@249043822
Copy link
Member

a kms transformer which uses a grpc-service but doen't provide a mechnism to close the connection of grpc-service , so we should assure connection close while apiserver shutdown. since the lifecycle of the transformer is consistent with apiserver, does listenning a shutdown signal(or a stopCh) for grpc-service closing more suitable than a context? @aojea @enj @liggitt

@wojtek-t
Copy link
Member

wojtek-t commented Aug 9, 2022

a kms transformer which uses a grpc-service but doen't provide a mechnism to close the connection of grpc-service , so we should assure connection close while apiserver shutdown. since the lifecycle of the transformer is consistent with apiserver, does listenning a shutdown signal(or a stopCh) for grpc-service closing more suitable than a context? @aojea @enj @liggitt

Storage is being destroyed at the very end:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go#L464
We still don't have graceful shutdown of watches (which can use transformers), so we need to ensure that transformers won't get shutdown too early.

Instead, any implementation that needs any form of cleanup should take a context in its constructor and automatically perform said cleanup when the context is canceled.

@enj - I don't have very strong opinion on that and if we can make that work with contexts - I'm fine with it.
That said, currently transformers (in particular the KMS ones) are created very early, before even the apiserver config is created:

envelopeService, err = envelopeServiceFactory(provider.KMS.Endpoint, provider.KMS.Timeout.Duration)

[it's called at the level of initializing options]
So this approach would require significant changes to be implemented.

OTOH, the current Destroy/Stop-like approach is relatively easy - we don't expect people to be doing it on their own. It's the responsibility of the server to do that on shutdown, and as such I don't think that the concern of someone forgetting to call it is a valid one.

@enj
Copy link
Member

enj commented Aug 9, 2022

We already have existing bugs for CRDs not supporting encryption at rest because the wiring is incorrect so I don't buy the argument that "we won't forget because we only need to do it in one place in the API server" - nothing enforces that invariant. If we want to get the go routine number down before test freeze with this change, that is fine. I have to refactor a lot of this code for KMS v2 so I can clean it up then.

@wojtek-t
Copy link
Member

wojtek-t commented Aug 9, 2022

We already have existing bugs for CRDs not supporting encryption at rest because the wiring is incorrect so I don't buy the argument that "we won't forget because we only need to do it in one place in the API server"

My point was that cost is not per transformer-author. We have a fixed cost to properly wire the Initialize/Destroy interface (and yes, I agree with you that it's more than one place, because e.g. we do that differently for CRDs vs built-ins).
But I disagree that the model with destroy is more complex - this is roughly symmetric. In one case (your proposal) we need to enforce proper initialization (also not exactly the same across CRDs and built-ins), in this case, we need to ensure proper deletion. I don't agree that one is visibly simpler than the other.

I don't know what your plans for KMSv2 are exactly and I admit that the additional interface method isn't perfect. But it has one advantage of forcing people to think about the lifecycle, which definitely isn't the case for many authors now (not specific to transformers - it's the same for many controllers, plugins, etc.).

I don't think it's super critical for 1.25, so if you are planning to changing that soon (in 1.26) I'm fine with waiting to see if we can make that work in a simpler way then. If this will take multiple releases to get there, I think we should try merging something like this (I didn't do detailed review of it - just skimmed it).

@enj
Copy link
Member

enj commented Aug 9, 2022

KMS v2 is part of my day job so it will definitely get done 😉

@liggitt had expressed wanting this go routine bit fixed in 1.25 so this may be fine for now (I haven't looked thoroughly).

@fedebongio
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 9, 2022
@aojea
Copy link
Member

aojea commented Aug 9, 2022

KMS v2 is part of my day job so it will definitely get done 😉

nice

@wojtek-t can you take a look and judge if we should add it to 1.25, we are still on the testing freeze period https://www.kubernetes.dev/resources/release/#summary

@wojtek-t
Copy link
Member

@wojtek-t can you take a look and judge if we should add it to 1.25, we are still on the testing freeze period

I took a deeper look into this PR and this LGTM.
I think that even if longer-term we decide to change the interface (as part of changes for KMSv2) and get rid of the Stop() method this is fine.
But given I don't know the exact plans for KMSv2 I would like to hear an ACK from @enj before approving.

From purely test-freeze perpective, that looks reasonable to merge.

@enj
Copy link
Member

enj commented Aug 10, 2022

I have to do a lot of refactors to the KMS code to support the v2 feature set so this will not be a big issue for me to update. @wojtek-t you will have to approve the API server code changes.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Aug 10, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2022
@k8s-ci-robot k8s-ci-robot removed this from the v1.25 milestone Aug 10, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: enj, sanwishe
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t by writing /assign @wojtek-t in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@enj
Copy link
Member

enj commented Aug 10, 2022

Test freeze for 1.25 was 18:00 PDT Tuesday 9th August 2022 so I think we have missed the window for 1.25...

@enj
Copy link
Member

enj commented Aug 10, 2022

/lgtm cancel

(since we missed the deadline)

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 10, 2022
@enj
Copy link
Member

enj commented Aug 24, 2022

I have opened #111986 to fix this bug without expanding the transformer interface.

@enj
Copy link
Member

enj commented Sep 8, 2022

@sanwishe thanks for your contribution.

This issue was resolved in #111986

/close

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 8, 2022
@k8s-ci-robot
Copy link
Contributor

@sanwishe: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@enj: Closed this PR.

In response to this:

@sanwishe thanks for your contribution.

This issue was resolved in #111986

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

reduce goroutine leakage in test/integration/controlplane/transformation
8 participants