New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-2873: fix certificate reloads after rotation #145
OCPBUGS-2873: fix certificate reloads after rotation #145
Conversation
7c8c6f3
to
cdc1eec
Compare
cdc1eec
to
d956fbc
Compare
@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: juzhao. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
05b3135
to
064c14b
Compare
When the TLS certificate (used by Prometheus to authenticate to the scraped targets) gets rotated, Prometheus doesn't pick up the new certificate until the connection to the target is re-established. Because Prometheus uses keep-alive HTTP connections, the consequence is that the scrapes start failing after about 1 day and the TargetDown alert fires. There's an upstream pull request [1] to address the issue but it isn't merged yet. This commit pulls the changes from [1] into our downstream fork by adding a replace directive to go.mod for the github.com/prometheus/common. The replacement code is under patches/github.com/prometheus/common which is the same version as upstream (v0.37.0) + the upstream PR applied on top of it. As soon as upstream Prometheus depends on a version of github.com/prometheus/common that fixes the issue, the replace directive in go.mod and the code under the patches/ directory can be removed. [1] prometheus/common#345 Signed-off-by: Simon Pasquier <spasquie@redhat.com>
064c14b
to
c0d8fb4
Compare
/hold waiting for @juzhao to verify. Since the initial certificate expires after roughly 24h, it's not possible for me to test it (either with an end-to-end test or manually). |
Is the fix to reloading certificates lying in the file |
yes, this is a copy/paste from prometheus/common#345 |
@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: juzhao. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Can't we force a certificate rotation ahead of schedule? |
/retest |
/cherry-pick release-4.12 |
@simonpasquier: once the present PR merges, I will cherry-pick it on top of release-4.12 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
2 similar comments
/retest |
/retest |
/retest-required |
2 similar comments
/retest-required |
/retest-required |
since we can not verify the bug with cluster-bot cluster(the cluster would be automatically destroyed within ~3 hours ), will verify the bug after the code merged to payload and will let the cluster run for more than 1 day then monitor if the targets are down |
thanks @juzhao for what is worth, there's a unit test for the change so I'm reasonably optimistic that it will fix the issue. |
/test e2e-agnostic-cmo |
5 similar comments
/test e2e-agnostic-cmo |
/test e2e-agnostic-cmo |
/test e2e-agnostic-cmo |
/test e2e-agnostic-cmo |
/test e2e-agnostic-cmo |
/hold waiting for openshift/cluster-monitoring-operator#1817 to merge as |
/test e2e-agnostic-cmo |
/hold cancel |
@simonpasquier: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@simonpasquier: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-2873 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@simonpasquier: new pull request created: #149 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.
There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.
As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.
[1] prometheus/common#345