Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nil pointer dereference in main-2022-10-02-d533abf8 forward #6021

Closed
RohitKochhar opened this issue Jan 5, 2023 · 35 comments
Closed

nil pointer dereference in main-2022-10-02-d533abf8 forward #6021

RohitKochhar opened this issue Jan 5, 2023 · 35 comments

Comments

@RohitKochhar
Copy link
Contributor

RohitKochhar commented Jan 5, 2023

Thanos, Prometheus and Golang version used: main-2022-10-02-d533abf8

Object Storage Provider: AWS

What happened:

Query pods fail with the following logs:

level=info ts=2023-01-09T19:05:58.690951095Z caller=factory.go:43 msg="loading tracing configuration"
level=info ts=2023-01-09T19:05:58.696610373Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2023-01-09T19:05:58.697727856Z caller=query.go:759 msg="starting query node"
level=info ts=2023-01-09T19:05:58.697991079Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2023-01-09T19:05:58.69803157Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2023-01-09T19:05:58.698204012Z caller=tls_config.go:195 service=http/server component=query msg="TLS is disabled." http2=false
level=info ts=2023-01-09T19:05:58.698275323Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2023-01-09T19:05:58.698316933Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=error ts=2023-01-09T19:05:58.700889734Z caller=resolver.go:99 msg="failed to lookup SRV records" host=_grpc._tcp.thanos-ruler-operated err="no such host"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x13e0ee8]

goroutine 296 [running]:
go.opentelemetry.io/otel/sdk/trace.parentBased.ShouldSample({{0x0, 0x0}, {{0x2672118, 0x39346d0}, {0x26720f0, 0x39346d0}, {0x2672118, 0x39346d0}, {0x26720f0, 0x39346d0}}}, ...)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/sampling.go:280 +0x1c8
github.com/thanos-io/thanos/pkg/tracing/migration.samplerWithOverride.ShouldSample(...)
	/app/pkg/tracing/migration/sampler.go:42
go.opentelemetry.io/otel/sdk/trace.(*tracer).newSpan(0xc0006e1bc0, {0x267d0d0, 0xc0007ec120}, {0xc000778630, 0x16}, 0xc0007eb470)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/tracer.go:90 +0x456
go.opentelemetry.io/otel/sdk/trace.(*tracer).Start(0xc0006e1bc0, {0x267d0d0, 0xc0007ec120}, {0xc000778630, 0x16}, {0xc0007bdc00?, 0x4?, 0xc0007eb5a0?})
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/tracer.go:47 +0x10e
go.opentelemetry.io/otel/bridge/opentracing.(*WrapperTracer).Start(0xc000782228, {0x267d0d0?, 0xc0007ec120?}, {0xc000778630?, 0x1e437a0?}, {0xc0007bdc00?, 0x7f52aecb03d8?, 0xc000a209c0?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/wrapper.go:79 +0x4b
go.opentelemetry.io/otel/bridge/opentracing.(*BridgeTracer).StartSpan(0xc000770960, {0xc000778630, 0x16}, {0xc0007ec0c0, 0x3, 0x39346d0?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/bridge.go:430 +0x3f4
github.com/thanos-io/thanos/pkg/tracing/migration.(*bridgeTracerWrapper).StartSpan(0x1f9a9a0?, {0xc000778630?, 0x39346d0?}, {0xc0007ec0c0?, 0x38c9480?, 0xc0007eb818?})
	/app/pkg/tracing/migration/bridge.go:89 +0x26
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/tracing.newClientSpanFromContext({0x267d098, 0xc0007b5b60}, {0x2678120, 0xc000125748}, {0xc000778630, 0x16})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/tracing/client.go:92 +0x244
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/tracing.(*opentracingClientReportable).ClientReporter(0xc0007824e0, {0x267d098, 0xc0007b5b60}, {0x0?, 0x0?}, {0x2197e1a, 0x5}, {0x21bcafa, 0x10}, {0x21bcb0b, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/tracing/client.go:51 +0x127
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.UnaryClientInterceptor.func1({0x267d098, 0xc0007b5b60}, {0x21bcaf9, 0x16}, {0x209f7e0, 0x39346d0}, {0x209f920, 0xc0007b5c80}, 0xc000100960?, 0x227e1b8, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/client.go:19 +0x195
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1({0x267d098?, 0xc0007b5b60?}, {0x21bcaf9?, 0x38?}, {0x209f7e0?, 0x39346d0?}, {0x209f920?, 0xc0007b5c80?}, 0x0?, {0xc000a13e80, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:74 +0x86
github.com/grpc-ecosystem/go-grpc-prometheus.(*ClientMetrics).UnaryClientInterceptor.func1({0x267d098, 0xc0007b5b60}, {0x21bcaf9, 0x16}, {0x209f7e0, 0x39346d0}, {0x209f920, 0xc0007b5c80}, 0x8?, 0xc00000d020, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/client_metrics.go:112 +0x117
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1({0x267d098?, 0xc0007b5b60?}, {0x21bcaf9?, 0x203000?}, {0x209f7e0?, 0x39346d0?}, {0x209f920?, 0xc0007b5c80?}, 0x1?, {0xc000a13e80, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:74 +0x86
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1({0x267d098, 0xc0007b5b60}, {0x21bcaf9, 0x16}, {0x209f7e0, 0x39346d0}, {0x209f920, 0xc0007b5c80}, 0x0?, 0x227e1b8, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:83 +0x157
google.golang.org/grpc.(*ClientConn).Invoke(0xc0005c3900?, {0x267d098?, 0xc0007b5b60?}, {0x21bcaf9?, 0x16?}, {0x209f7e0?, 0x39346d0?}, {0x209f920?, 0xc0007b5c80?}, {0xc00071b680, ...})
	/go/pkg/mod/google.golang.org/grpc@v1.45.0/call.go:35 +0x223
github.com/thanos-io/thanos/pkg/info/infopb.(*infoClient).Info(0xc000a04610, {0x267d098, 0xc0007b5b60}, 0x8?, {0xc00071b680, 0x1, 0x1})
	/app/pkg/info/infopb/rpc.pb.go:422 +0xc9
github.com/thanos-io/thanos/pkg/query.(*endpointRef).Metadata(0xc00016fa80, {0x267d098, 0xc0007b5b60}, {0x2663ca0, 0xc000a04610}, {0x267e368, 0xc000a04618})
	/app/pkg/query/endpointset.go:66 +0x11e
github.com/thanos-io/thanos/pkg/query.(*EndpointSet).updateEndpoint(0xc0000b4980, {0x267d098, 0xc0007b5b60}, 0xc00000ced0, 0xc00016fa80)
	/app/pkg/query/endpointset.go:414 +0x105
github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update.func2(0xc00000ced0)
	/app/pkg/query/endpointset.go:354 +0x2cb
created by github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update
	/app/pkg/query/endpointset.go:343 +0x60a

Ruler pods fail with the following logs:

level=info ts=2023-01-09T19:06:55.17554435Z caller=intrumentation.go:56 component=rules msg="changing probe status" status=ready
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x13e0ee8]

goroutine 1725 [running]:
go.opentelemetry.io/otel/sdk/trace.parentBased.ShouldSample({{0x0, 0x0}, {{0x2672118, 0x39346d0}, {0x26720f0, 0x39346d0}, {0x2672118, 0x39346d0}, {0x26720f0, 0x39346d0}}}, ...)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/sampling.go:280 +0x1c8
github.com/thanos-io/thanos/pkg/tracing/migration.samplerWithOverride.ShouldSample(...)
	/app/pkg/tracing/migration/sampler.go:42
go.opentelemetry.io/otel/sdk/trace.(*tracer).newSpan(0xc00084ebc0, {0x267d0d0, 0xc001b3e2d0}, {0x21a41c9, 0xc}, 0xc04dae9ab8)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/tracer.go:90 +0x456
go.opentelemetry.io/otel/sdk/trace.(*tracer).Start(0xc00084ebc0, {0x267d0d0, 0xc001b3e2d0}, {0x21a41c9, 0xc}, {0xc020e56a40?, 0x4?, 0xc04c3b4be8?})
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/tracer.go:47 +0x10e
go.opentelemetry.io/otel/bridge/opentracing.(*WrapperTracer).Start(0xc000516870, {0x267d0d0?, 0xc001b3e2d0?}, {0x21a41c9?, 0x1e437a0?}, {0xc020e56a40?, 0x0?, 0xc045d54c00?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/wrapper.go:79 +0x4b
go.opentelemetry.io/otel/bridge/opentracing.(*BridgeTracer).StartSpan(0xc00039ede0, {0x21a41c9, 0xc}, {0x0, 0x0, 0x33450?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/bridge.go:430 +0x3f4
github.com/thanos-io/thanos/pkg/tracing/migration.(*bridgeTracerWrapper).StartSpan(0x267d0d0?, {0x21a41c9?, 0x10000000043b247?}, {0x0?, 0xc045c1c780?, 0xc045d54e90?})
	/app/pkg/tracing/migration/bridge.go:89 +0x26
github.com/thanos-io/thanos/pkg/tracing.StartSpan({0x267d0d0, 0xc03ace95c0}, {0x21a41c9, 0xc}, {0x0, 0x0, 0x0})
	/app/pkg/tracing/tracing.go:72 +0x18f
github.com/thanos-io/thanos/pkg/tracing.DoInSpan({0x267d0d0?, 0xc03ace95c0?}, {0x21a41c9?, 0x7a?}, 0xc04dae9f60, {0x0?, 0xc0469fa370?, 0xc0461c2840?})
	/app/pkg/tracing/tracing.go:93 +0x59
main.runRule.func7()
	/app/cmd/thanos/rule.go:532 +0x96
github.com/oklog/run.(*Group).Run.func1({0xc03ace95f0?, 0xc000293a40?})
	/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38 +0x2f
created by github.com/oklog/run.(*Group).Run
	/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:37 +0x22a

What you expected to happen:

Query and Ruler pods to behave as expected

How to reproduce it (as minimally and precisely as possible):

Deploy a query pod with the following STS configuration:

containers:
      - name: thanos-query
        args:
        - query
        - --grpc-address=[REDACTED]
        - --http-address=[REDACTED]
        - --log.level=info
        - --log.format=logfmt
        - --query.replica-label=prometheus_replica
        - --query.replica-label=rule_replica
        - --query.replica-label=thanos_ruler_replica
        - --endpoint=dnssrv+_grpc._tcp.thanos-ruler-operated
        - --endpoint=dnssrv+_grpc._tcp.thanos-sidecar
        - --endpoint=dnssrv+_grpc._tcp.thanos-store
        - --query.auto-downsampling
        - --tracing.config=$(TRACING_CONFIG)

Deploy a ruler pod using the monitoring.coreos/v1 CRD

Full logs to relevant components:

Anything else we need to know:

@douglascamata
Copy link
Contributor

douglascamata commented Jan 5, 2023

@RohitKochhar was kind enough to try the image tags built from the commits merged with main (see conversation at https://cloud-native.slack.com/archives/CL25937SP/p1672937478880859) and the results are:

  • main-2022-10-31-7968915 has the bug.
  • main-2022-10-30-0e0f4fb doesn't have the bug.

The only difference between these two images is the presence of the commit 7968915 (from #5716) in the image that has the bug.

Can anyone else confirm the DNS resolution is working fine on v0.30?

This might require a release 0.30.2 with a hotfix or rollback of that commit.

@GiedriusS
Copy link
Member

Interesting 🤔 according to the stacktrace, crash happens here i.e. pb.root is nil. I wonder how this DNS stuff relates to the stacktrace 😄

@douglascamata
Copy link
Contributor

@RohitKochhar what is resolving DNS in your cluster: CoreDNS or KubeDNS? Which k8s version are you running?

@douglascamata
Copy link
Contributor

Interesting 🤔 according to the stacktrace, crash happens here i.e. pb.root is nil. I wonder how this DNS stuff relates to the stacktrace 😄

@GiedriusS are you sure that this is the correct reading stacktrace? I think tracing code will always be at the top of the stack because it wraps everything. But down there we can see github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update, that should be using the data produced by the code modified in 7968915. I think the nil pointer derefence comes from the new code somehow returning a nil and it blows up exactly at the 2nd innermost LoC in the stack (the 1st innermost is an anonymous go-function call):

e.updateEndpoint(ctx, spec, newRef)

@GiedriusS
Copy link
Member

Oops, yes, my bad 😞 you are correct 👍

@RohitKochhar
Copy link
Contributor Author

Pinging for an update on this @douglascamata @fpetkovski

@douglascamata
Copy link
Contributor

@RohitKochhar we need some extra information to try to reproduce the issue, investigate and work on a fix. See my previous comment: #6021 (comment).

@RohitKochhar
Copy link
Contributor Author

Oh, sorry I didn't see that. We use CoreDNS @douglascamata

@GiedriusS
Copy link
Member

@RohitKochhar do you have tracing enabled? Could we see the config?

@douglascamata
Copy link
Contributor

I am not able to reproduce this in a prod-like cluster also using CoreDNS. We need your help @RohitKochhar to investigate this cluster and look for anything out of the ordinary.

@RohitKochhar
Copy link
Contributor Author

Ok, I can form a better summary and share it soon

@RohitKochhar
Copy link
Contributor Author

@douglascamata after looking into it further, I am unsure why I didn't catch this originally, but the breaking version is v0.29.0. I am able to upgrade to v0.28.0 without problem

@douglascamata
Copy link
Contributor

There's some conflicting information then. I asked you to try two different images tags straight from the main branch container builds: main-2022-10-31-7968915 (has the bug) and main-2022-10-30-0e0f4fb (doesn't have the bug). Both of these images contain all the commits from v0.29.0 as they are part of the v0.30 release.

@RohitKochhar
Copy link
Contributor Author

Yeah, I think I incorrectly synced the resources through ArgoCD. I am currently retrying main-2022-10-30-0e0f4fb

@RohitKochhar
Copy link
Contributor Author

RohitKochhar commented Jan 9, 2023

I can now confirm that main-2022-10-30-0e0f4fb does not work for our query and ruler pods

@douglascamata
Copy link
Contributor

Okay. So that rules out what we thought caused your issue. 😄

If you have some time, please try to bisect the snapshot images to find exactly where it works. But seems like your pods do not like something in the miekdns resolver. It has been the default in Query since v0.23 and in Rule since v0.30.

@douglascamata
Copy link
Contributor

Between v0.28 and v0.29 though we had no DNS changes.

@RohitKochhar
Copy link
Contributor Author

Ok, I'll start working my way through the snapshot images between v0.28 and v0.29 to find the last version I can use before breaking

@RohitKochhar
Copy link
Contributor Author

@douglascamata I added more information into the initial issue comment, including the logs for both ruler and query pods that are failing and the configuration used for both.

The latest snapshot I could use was main-2022-09-30-21aa1db0, anything from main-2022-10-02-d533abf8 causes the logs above. Can you point me to commits that were added between these images?

@RohitKochhar RohitKochhar changed the title nil pointer dereference in v0.30.0 nil pointer dereference in main-2022-10-02-d533abf8 forward Jan 9, 2023
@douglascamata
Copy link
Contributor

The latest snapshot I could use was main-2022-09-30-21aa1db0, anything from main-2022-10-02-d533abf8 causes the logs above. Can you point me to commits that were added between these images?

That's pretty easy. The image tag format is <BRANCH>-<DATE>-<COMMIT_HASH>.

From 21aa1db to d533abf we have only 64e7e3b.

Are you sure that main-2022-10-02-64e7e3bb worked? This came right after main-2022-09-30-21aa1db0 and is migrating tracing from OpenTracing to OTLP. This makes me think @GiedriusS's initial hunch was correct.

@RohitKochhar
Copy link
Contributor Author

main-2022-10-02-64e7e3bb does not work, I think you are right that @GiedriusS initial hunch was correct. Any ideas for solutions?

@RohitKochhar
Copy link
Contributor Author

Here is our tracing configuration:

type: jaeger
config:
  service_name: thanos-query
  endpoint: $(JAEGER_ENDPOINT)
  sampler_manager_host_port: $(JAEGER_SAMPLING_ENDPOINT)

With the jaeger endpoints pointing to our grafana-agent-trace urls

@matej-g
Copy link
Collaborator

matej-g commented Jan 10, 2023

Hmm this looks like #5872 which is fixed in #5887, so it should be fine, but it seems as if this was never merged back to main from 0.29, can you confirm @yeya24?

@yeya24
Copy link
Contributor

yeya24 commented Jan 10, 2023

Yeah I think that's the same issue. We just never released v0.29.1 and never merge back those commits to main.

@douglascamata
Copy link
Contributor

douglascamata commented Jan 10, 2023

Great find, @matej-g! 👏

@yeya24 @matej-g: shall we merge it back then and release 0.29.1 and 0.30.2?

@yeya24
Copy link
Contributor

yeya24 commented Jan 10, 2023

@douglascamata Yeah I think we just needs to do the release and that's it. I think we didn't release v0.29.1 previously because of #5862 IIRC. @GiedriusS

@douglascamata
Copy link
Contributor

We need to release v0.29.1, then cherry pick the commit df047ef into main and release v0.30.2 with it.

@xBazilio
Copy link
Contributor

Won't work df047ef#diff-eb3987ed9aba572232997e0b7878dbe798ccd79ca31bf8e933963e89d2901723R143

sampler still will be nil because "default" case will never be evaluated.

@xBazilio
Copy link
Contributor

xBazilio commented Jan 20, 2023

A solution would be removing "default:" or adding "fallthrough"

@douglascamata
Copy link
Contributor

@xBazilio is this a theory or you tested it out?

@xBazilio
Copy link
Contributor

xBazilio commented Jan 20, 2023

@xBazilio is this a theory or you tested it out?

Yes, I've Tested. Deployed on dev, caught panic, searched and found this issue. Then i cherry-picked the commit, panic still existed. Then with debugger I found this switch/case issue and fixed it by removing "default:" line. Colleague suggested using "fallthrough" keyword instead, but right now it isn't necessary as "SamplerTypeRateLimiting" is the default.

@douglascamata
Copy link
Contributor

@xBazilio would you like to open a PR with this fix?

@xBazilio
Copy link
Contributor

I created the PR https://github.com/thanos-io/thanos/pull/6066/files
it's beeing reviewed

@douglascamata
Copy link
Contributor

@xBazilio just missing the changelog entry and then it'll be perfect. Thanks for the contribution!

@RohitKochhar
Copy link
Contributor Author

This seems to be solved by upgrading to v0.30.2, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants