Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-2873: fix certificate reloads after rotation #145

Merged

Conversation

simonpasquier
Copy link

@simonpasquier simonpasquier commented Oct 26, 2022

When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2022
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2022
@simonpasquier simonpasquier force-pushed the fix-expired-certs branch 2 times, most recently from 7c8c6f3 to cdc1eec Compare October 26, 2022 10:50
@simonpasquier simonpasquier changed the title wip: use patched prometheus/common library for cert reloads OCPBUGS-2873: fix certificate reloads after rotation Oct 26, 2022
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 26, 2022
@openshift-ci-robot
Copy link

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is invalid:

  • expected the bug to target the "4.12.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2022
@simonpasquier
Copy link
Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 26, 2022
@openshift-ci-robot
Copy link

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Oct 26, 2022

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: juzhao.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@simonpasquier simonpasquier force-pushed the fix-expired-certs branch 2 times, most recently from 05b3135 to 064c14b Compare October 26, 2022 14:35
When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@simonpasquier
Copy link
Author

/hold

waiting for @juzhao to verify. Since the initial certificate expires after roughly 24h, it's not possible for me to test it (either with an end-to-end test or manually).

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 26, 2022
@raptorsun
Copy link

Is the fix to reloading certificates lying in the file vendor/github.com/prometheus/common/config/http_config.go, adding keyfile and cert file into tlsRoundTripper?

@simonpasquier
Copy link
Author

Is the fix to reloading certificates lying in the file vendor/github.com/prometheus/common/config/http_config.go, adding keyfile and cert file into tlsRoundTripper?

yes, this is a copy/paste from prometheus/common#345

@openshift-ci-robot
Copy link

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented Oct 26, 2022

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: juzhao.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@simonpasquier: This pull request references Jira Issue OCPBUGS-2873, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

In response to this:

When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@raptorsun
Copy link

raptorsun commented Oct 26, 2022

Is the fix to reloading certificates lying in the file vendor/github.com/prometheus/common/config/http_config.go, adding keyfile and cert file into tlsRoundTripper?

yes, this is a copy/paste from prometheus/common#345

Can't we force a certificate rotation ahead of schedule?

@simonpasquier
Copy link
Author

/retest

@simonpasquier
Copy link
Author

/cherry-pick release-4.12

@openshift-cherrypick-robot

@simonpasquier: once the present PR merges, I will cherry-pick it on top of release-4.12 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@simonpasquier
Copy link
Author

/retest

2 similar comments
@jan--f
Copy link

jan--f commented Nov 16, 2022

/retest

@jan--f
Copy link

jan--f commented Nov 16, 2022

/retest

@jan--f
Copy link

jan--f commented Nov 17, 2022

/retest-required

2 similar comments
@jan--f
Copy link

jan--f commented Nov 17, 2022

/retest-required

@raptorsun
Copy link

/retest-required

@juzhao
Copy link

juzhao commented Nov 21, 2022

since we can not verify the bug with cluster-bot cluster(the cluster would be automatically destroyed within ~3 hours ), will verify the bug after the code merged to payload and will let the cluster run for more than 1 day then monitor if the targets are down
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 21, 2022
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD b1b8dbf and 2 for PR HEAD c0d8fb4 in total

@simonpasquier
Copy link
Author

thanks @juzhao for what is worth, there's a unit test for the change so I'm reasonably optimistic that it will fix the issue.

@simonpasquier
Copy link
Author

/test e2e-agnostic-cmo

5 similar comments
@simonpasquier
Copy link
Author

/test e2e-agnostic-cmo

@simonpasquier
Copy link
Author

/test e2e-agnostic-cmo

@simonpasquier
Copy link
Author

/test e2e-agnostic-cmo

@simonpasquier
Copy link
Author

/test e2e-agnostic-cmo

@simonpasquier
Copy link
Author

/test e2e-agnostic-cmo

@simonpasquier
Copy link
Author

/hold

waiting for openshift/cluster-monitoring-operator#1817 to merge as assert_remote_write_cluster_id_relabel_config_works is failing repeatedly.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2022
@simonpasquier
Copy link
Author

/test e2e-agnostic-cmo

@simonpasquier
Copy link
Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 23, 2022
@openshift-ci
Copy link

openshift-ci bot commented Nov 23, 2022

@simonpasquier: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 6e59b84 into openshift:master Nov 23, 2022
@openshift-ci-robot
Copy link

@simonpasquier: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-2873 has been moved to the MODIFIED state.

In response to this:

When the TLS certificate (used by Prometheus to authenticate to the
scraped targets) gets rotated, Prometheus doesn't pick up the new
certificate until the connection to the target is re-established.
Because Prometheus uses keep-alive HTTP connections, the consequence is
that the scrapes start failing after about 1 day and the TargetDown
alert fires.

There's an upstream pull request [1] to address the issue but it isn't
merged yet. This commit pulls the changes from [1] into our downstream
fork by adding a replace directive to go.mod for the
github.com/prometheus/common. The replacement code is under
patches/github.com/prometheus/common which is the same version as
upstream (v0.37.0) + the upstream PR applied on top of it.

As soon as upstream Prometheus depends on a version of
github.com/prometheus/common that fixes the issue, the replace directive
in go.mod and the code under the patches/ directory can be removed.

[1] prometheus/common#345

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@simonpasquier: new pull request created: #149

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@simonpasquier simonpasquier deleted the fix-expired-certs branch November 23, 2022 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants