OCPBUGS-2873: fix certificate reloads after rotation #145

simonpasquier · 2022-10-26T10:34:53Z

When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

openshift-ci-robot · 2022-10-26T14:05:07Z

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is invalid:

expected the bug to target the "4.12.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

simonpasquier · 2022-10-26T14:05:50Z

/jira refresh

openshift-ci-robot · 2022-10-26T14:05:56Z

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.12.0) matches configured target version for branch (4.12.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-10-26T14:06:05Z

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: juzhao.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target version (4.12.0) matches configured target version for branch (4.12.0)

bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

When the TLS certificate (used by Prometheus to authenticate to the scraped targets) gets rotated, Prometheus doesn't pick up the new certificate until the connection to the target is re-established. Because Prometheus uses keep-alive HTTP connections, the consequence is that the scrapes start failing after about 1 day and the TargetDown alert fires. There's an upstream pull request [1] to address the issue but it isn't merged yet. This commit pulls the changes from [1] into our downstream fork by adding a replace directive to go.mod for the github.com/prometheus/common. The replacement code is under patches/github.com/prometheus/common which is the same version as upstream (v0.37.0) + the upstream PR applied on top of it. As soon as upstream Prometheus depends on a version of github.com/prometheus/common that fixes the issue, the replace directive in go.mod and the code under the patches/ directory can be removed. [1] prometheus/common#345 Signed-off-by: Simon Pasquier <spasquie@redhat.com>

simonpasquier · 2022-10-26T15:41:07Z

/hold

waiting for @juzhao to verify. Since the initial certificate expires after roughly 24h, it's not possible for me to test it (either with an end-to-end test or manually).

raptorsun · 2022-10-26T16:13:13Z

Is the fix to reloading certificates lying in the file vendor/github.com/prometheus/common/config/http_config.go, adding keyfile and cert file into tlsRoundTripper?

simonpasquier · 2022-10-26T16:19:48Z

Is the fix to reloading certificates lying in the file vendor/github.com/prometheus/common/config/http_config.go, adding keyfile and cert file into tlsRoundTripper?

yes, this is a copy/paste from prometheus/common#345

openshift-ci-robot · 2022-10-26T16:20:08Z

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.12.0) matches configured target version for branch (4.12.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2022-10-26T16:20:10Z

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: juzhao.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target version (4.12.0) matches configured target version for branch (4.12.0)

bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

raptorsun · 2022-10-26T16:41:05Z

Is the fix to reloading certificates lying in the file vendor/github.com/prometheus/common/config/http_config.go, adding keyfile and cert file into tlsRoundTripper?

yes, this is a copy/paste from prometheus/common#345

Can't we force a certificate rotation ahead of schedule?

simonpasquier · 2022-11-15T14:01:28Z

/retest

simonpasquier · 2022-11-15T14:01:39Z

/cherry-pick release-4.12

openshift-cherrypick-robot · 2022-11-15T14:01:42Z

@simonpasquier: once the present PR merges, I will cherry-pick it on top of release-4.12 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

simonpasquier · 2022-11-15T16:33:45Z

/retest

jan--f · 2022-11-16T09:17:37Z

/retest

jan--f · 2022-11-16T10:21:12Z

/retest

jan--f · 2022-11-17T07:42:42Z

/retest-required

jan--f · 2022-11-17T14:43:00Z

/retest-required

raptorsun · 2022-11-17T17:09:00Z

/retest-required

juzhao · 2022-11-21T06:47:27Z

since we can not verify the bug with cluster-bot cluster(the cluster would be automatically destroyed within ~3 hours ), will verify the bug after the code merged to payload and will let the cluster run for more than 1 day then monitor if the targets are down
/label qe-approved

openshift-ci-robot · 2022-11-21T07:07:56Z

/retest-required

Remaining retests: 0 against base HEAD b1b8dbf and 2 for PR HEAD c0d8fb4 in total

simonpasquier · 2022-11-21T08:12:33Z

thanks @juzhao for what is worth, there's a unit test for the change so I'm reasonably optimistic that it will fix the issue.

simonpasquier · 2022-11-21T11:54:27Z

/test e2e-agnostic-cmo

simonpasquier · 2022-11-21T15:24:11Z

/test e2e-agnostic-cmo

simonpasquier · 2022-11-22T08:34:18Z

/test e2e-agnostic-cmo

simonpasquier · 2022-11-22T10:59:56Z

/test e2e-agnostic-cmo

simonpasquier · 2022-11-22T14:01:27Z

/test e2e-agnostic-cmo

simonpasquier · 2022-11-22T16:02:24Z

/test e2e-agnostic-cmo

simonpasquier · 2022-11-23T09:51:42Z

/hold

waiting for openshift/cluster-monitoring-operator#1817 to merge as assert_remote_write_cluster_id_relabel_config_works is failing repeatedly.

simonpasquier · 2022-11-23T12:33:30Z

/test e2e-agnostic-cmo

simonpasquier · 2022-11-23T14:57:47Z

/hold cancel

openshift-ci · 2022-11-23T15:19:47Z

@simonpasquier: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2022-11-23T15:24:33Z

@simonpasquier: All pull requests linked via external trackers have merged:

openshift/prometheus#145

Jira Issue OCPBUGS-2873 has been moved to the MODIFIED state.

In response to this:

When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-cherrypick-robot · 2022-11-23T15:25:30Z

@simonpasquier: new pull request created: #149

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2022

openshift-ci bot requested review from jan--f and raptorsun October 26, 2022 10:35

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2022

simonpasquier force-pushed the fix-expired-certs branch 2 times, most recently from 7c8c6f3 to cdc1eec Compare October 26, 2022 10:50

simonpasquier mentioned this pull request Oct 26, 2022

Update to v0.8.0 openshift/procfs#7

Closed

simonpasquier force-pushed the fix-expired-certs branch from cdc1eec to d956fbc Compare October 26, 2022 14:04

simonpasquier changed the title ~~wip: use patched prometheus/common library for cert reloads~~ OCPBUGS-2873: fix certificate reloads after rotation Oct 26, 2022

openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 26, 2022

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2022

simonpasquier force-pushed the fix-expired-certs branch 2 times, most recently from 05b3135 to 064c14b Compare October 26, 2022 14:35

simonpasquier force-pushed the fix-expired-certs branch from 064c14b to c0d8fb4 Compare October 26, 2022 14:36

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2022

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 21, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2022

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2022

openshift-merge-robot merged commit 6e59b84 into openshift:master Nov 23, 2022

openshift-cherrypick-robot mentioned this pull request Nov 23, 2022

[release-4.12] OCPBUGS-4048: fix certificate reloads after rotation #149

Merged

simonpasquier deleted the fix-expired-certs branch November 23, 2022 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-2873: fix certificate reloads after rotation #145

OCPBUGS-2873: fix certificate reloads after rotation #145

simonpasquier commented Oct 26, 2022 •

edited

openshift-ci-robot commented Oct 26, 2022

simonpasquier commented Oct 26, 2022

openshift-ci-robot commented Oct 26, 2022

openshift-ci bot commented Oct 26, 2022

simonpasquier commented Oct 26, 2022

raptorsun commented Oct 26, 2022

simonpasquier commented Oct 26, 2022

openshift-ci-robot commented Oct 26, 2022

openshift-ci bot commented Oct 26, 2022

raptorsun commented Oct 26, 2022 •

edited

simonpasquier commented Nov 15, 2022

simonpasquier commented Nov 15, 2022

openshift-cherrypick-robot commented Nov 15, 2022

simonpasquier commented Nov 15, 2022

jan--f commented Nov 16, 2022

jan--f commented Nov 16, 2022

jan--f commented Nov 17, 2022

jan--f commented Nov 17, 2022

raptorsun commented Nov 17, 2022

juzhao commented Nov 21, 2022

openshift-ci-robot commented Nov 21, 2022

simonpasquier commented Nov 21, 2022

simonpasquier commented Nov 21, 2022

simonpasquier commented Nov 21, 2022

simonpasquier commented Nov 22, 2022

simonpasquier commented Nov 22, 2022

simonpasquier commented Nov 22, 2022

simonpasquier commented Nov 22, 2022

simonpasquier commented Nov 23, 2022

simonpasquier commented Nov 23, 2022

simonpasquier commented Nov 23, 2022

openshift-ci bot commented Nov 23, 2022

openshift-ci-robot commented Nov 23, 2022

openshift-cherrypick-robot commented Nov 23, 2022

OCPBUGS-2873: fix certificate reloads after rotation #145

OCPBUGS-2873: fix certificate reloads after rotation #145

Conversation

simonpasquier commented Oct 26, 2022 • edited

openshift-ci-robot commented Oct 26, 2022

simonpasquier commented Oct 26, 2022

openshift-ci-robot commented Oct 26, 2022

openshift-ci bot commented Oct 26, 2022

simonpasquier commented Oct 26, 2022

raptorsun commented Oct 26, 2022

simonpasquier commented Oct 26, 2022

openshift-ci-robot commented Oct 26, 2022

openshift-ci bot commented Oct 26, 2022

raptorsun commented Oct 26, 2022 • edited

simonpasquier commented Nov 15, 2022

simonpasquier commented Nov 15, 2022

openshift-cherrypick-robot commented Nov 15, 2022

simonpasquier commented Nov 15, 2022

jan--f commented Nov 16, 2022

jan--f commented Nov 16, 2022

jan--f commented Nov 17, 2022

jan--f commented Nov 17, 2022

raptorsun commented Nov 17, 2022

juzhao commented Nov 21, 2022

openshift-ci-robot commented Nov 21, 2022

simonpasquier commented Nov 21, 2022

simonpasquier commented Nov 21, 2022

simonpasquier commented Nov 21, 2022

simonpasquier commented Nov 22, 2022

simonpasquier commented Nov 22, 2022

simonpasquier commented Nov 22, 2022

simonpasquier commented Nov 22, 2022

simonpasquier commented Nov 23, 2022

simonpasquier commented Nov 23, 2022

simonpasquier commented Nov 23, 2022

openshift-ci bot commented Nov 23, 2022

openshift-ci-robot commented Nov 23, 2022

openshift-cherrypick-robot commented Nov 23, 2022

simonpasquier commented Oct 26, 2022 •

edited

raptorsun commented Oct 26, 2022 •

edited