Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apiserver tracing: port forwarding broke #113791

Closed
pohly opened this issue Nov 9, 2022 · 18 comments · Fixed by #113803
Closed

apiserver tracing: port forwarding broke #113791

pohly opened this issue Nov 9, 2022 · 18 comments · Fixed by #113803
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. release-blocker sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@pohly
Copy link
Contributor

pohly commented Nov 9, 2022

What happened?

This is a regression caused by #113693. It worked with master a14601a (last merge commit before that PR) and is broken in b2c72fe.

Disabling the APIServerTracing feature gate works avoids the problem.

The problem is that port forwarding in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/storage/drivers/proxy/portproxy.go fails when used against a kind cluster built from Kubernetes master. #111023 depends on that for testing in pull-kubernetes-kind-dra.

The symptom is:

E1109 14:48:20.616608   67398 portproxy.go:134] port forwarding for dra-4197/dra-test-driver-57hbd:9001: no connection: error creating error stream: Timeout occurred

What did you expect to happen?

Should work as before...

How can we reproduce it (as minimally and precisely as possible)?

  • merge dynamic resource allocation #111023 with master
  • test/e2e/dra/kind-build-image.sh csiprow/node:latest
  • kind create cluster --config test/e2e/dra/kind.yaml --image csiprow/node:latest
  • KUBECONFIG=$HOME/.kube/config _output/bin/ginkgo -v -focus="DynamicResourceAllocation.*multiple drivers work" ./test/e2e

Anything else we need to know?

CSI mock tests should use the same port forwarding, but don't seem to be affected.

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@pohly pohly added the kind/bug Categorizes issue or PR as related to a bug. label Nov 9, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2022
@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

/kind regression
/milestone v1.26

@k8s-ci-robot k8s-ci-robot added the kind/regression Categorizes issue or PR as related to a regression from a prior release. label Nov 9, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 9, 2022
@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

/assign @dashpole

@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

since the error is context timeout related, it's also possible #113591 is related, or intersects with the tracing enablement (though the context change merged earlier, not in the noted range)

@dashpole
Copy link
Contributor

dashpole commented Nov 9, 2022

I had that thought as well. But the fact that disabling the APIServerTracing feature gate fixes the problem also points in the direction of it being related to enabling APIServerTracing

@dashpole
Copy link
Contributor

dashpole commented Nov 9, 2022

/sig instrumentation
/triage accepted

@k8s-ci-robot k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 9, 2022
@dashpole
Copy link
Contributor

dashpole commented Nov 9, 2022

I was able to reproduce it. Narrowed it down to the trace filter in staging/src/k8s.io/apiserver/pkg/endpoints/filters/traces.go

pohly added a commit to pohly/kubernetes that referenced this issue Nov 9, 2022
The PR which enabled APIServerTracing by default broke port forwarding between
e2e.test and the kind
cluster (kubernetes#113791). Disabling the
feature works around that problem.
@aojea
Copy link
Member

aojea commented Nov 9, 2022

@dashpole can you check if this failure reported by @alculquicondor https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/113794/pull-kubernetes-integration/1590410434526056448 can be related?

I1109 19:05:16.913713 105336 trace.go:219] Trace[644215460]: "Get" accept:application/vnd.kubernetes.protobuf, /,audit-id:3af83147-a35d-4b47-b577-098f6b60ec78,client:127.0.0.1,protocol:HTTP/2.0,resource:clusterrolebindings,scope:resource,url:/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/system:monitoring,user-agent:podgc.test/v0.0.0 (linux/amd64) kubernetes/$Format,verb:GET (09-Nov-2022 19:05:15.890) (total time: 1022ms):
Trace[644215460]: [1.022995607s] [1.022995607s] END
test_server.go:221: timed out waiting for the condition

@dashpole
Copy link
Contributor

dashpole commented Nov 9, 2022

That doesn't look related. From my testing thus far, it impacts requests with exec and portforward subresources

@dashpole
Copy link
Contributor

dashpole commented Nov 9, 2022

There is something wrong with how the request body is wrapped in otelhttp. This diff fixes it:

--- a/vendor/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp/handler.go
+++ b/vendor/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp/handler.go
@@ -20,7 +20,6 @@ import (
        "time"

        "github.com/felixge/httpsnoop"
-
        "go.opentelemetry.io/otel"
        "go.opentelemetry.io/otel/attribute"
        "go.opentelemetry.io/otel/metric"
@@ -156,22 +155,22 @@ func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        ctx, span := tracer.Start(ctx, h.spanNameFormatter(h.operation, r), opts...)
        defer span.End()

-       readRecordFunc := func(int64) {}
-       if h.readEvent {
-               readRecordFunc = func(n int64) {
-                       span.AddEvent("read", trace.WithAttributes(ReadBytesKey.Int64(n)))
-               }
-       }
+       // readRecordFunc := func(int64) {}
+       // if h.readEvent {
+       //      readRecordFunc = func(n int64) {
+       //              span.AddEvent("read", trace.WithAttributes(ReadBytesKey.Int64(n)))
+       //      }
+       // }

        var bw bodyWrapper
-       // if request body is nil we don't want to mutate the body as it will affect
-       // the identity of it in an unforeseeable way because we assert ReadCloser
-       // fulfills a certain interface and it is indeed nil.
-       if r.Body != nil {
-               bw.ReadCloser = r.Body
-               bw.record = readRecordFunc
-               r.Body = &bw
-       }
+       // // if request body is nil we don't want to mutate the body as it will affect
+       // // the identity of it in an unforeseeable way because we assert ReadCloser
+       // // fulfills a certain interface and it is indeed nil.
+       // if r.Body != nil {
+       //      bw.ReadCloser = r.Body
+       //      bw.record = readRecordFunc
+       //      r.Body = &bw
+       // }

        writeRecordFunc := func(int64) {}
        if h.writeEvent {

bodyWrapper is defined here:

https://github.com/kubernetes/kubernetes/blob/master/vendor/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp/wrap.go#L25-L48

@dashpole
Copy link
Contributor

dashpole commented Nov 9, 2022

I'll keep digging

@liggitt
Copy link
Member

liggitt commented Nov 9, 2022

There is something wrong with how the request body is wrapped in otelhttp. This diff fixes it:

what I find more concerning is that the promotion to beta apparently put tracing in the request path even when it wasn't configured... I thought modifying the request/filter path was only supposed to happen when the apiserver had --tracing-config-file set

@dashpole
Copy link
Contributor

Looks like a few more tests are broken. I think a revert is the best course for now.

@dashpole
Copy link
Contributor

#113803

@liggitt
Copy link
Member

liggitt commented Nov 10, 2022

Thanks, once the issue is resolved in otel, it would be good to figure out what presubmit signal we were missing that could have caught the issue… sounded like some of the folks hitting this did so quite reliably

@dashpole
Copy link
Contributor

I've opened the upstream fix for this, and verified that it fixes this reproduction case: open-telemetry/opentelemetry-go-contrib#2983

@dashpole
Copy link
Contributor

But I think it is still safer to leave it out of the release.

@liggitt
Copy link
Member

liggitt commented Nov 10, 2022

Thanks, once the issue is resolved in otel, it would be good to figure out what presubmit signal we were missing that could have caught the issue… sounded like some of the folks hitting this did so quite reliably

weirdly, there was a post-submit periodic job that turned super red because of this issue - https://testgrid.k8s.io/sig-node-containerd#containerd-e2e-ubuntu&width=20

it would be great to have had that signal before merge, not sure what is different in that job 😕

@dashpole
Copy link
Contributor

Yep. For future reference (for myself), an easy way to check would be to run the [sig-instrumentation] MetricsGrabber tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/regression Categorizes issue or PR as related to a regression from a prior release. release-blocker sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants