Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio mTLS fails after Prometheus 2.20.1 #9068

Closed
tillig opened this issue Jul 8, 2021 · 37 comments · Fixed by #9398
Closed

Istio mTLS fails after Prometheus 2.20.1 #9068

tillig opened this issue Jul 8, 2021 · 37 comments · Fixed by #9398

Comments

@tillig
Copy link

tillig commented Jul 8, 2021

What did you do?

Using Istio 1.6.14 I am mounting the Istio sidecar manually without proxying any traffic so I can access the Istio mTLS certificates. I have a scrape configuration set up to use those certificates to scrape endpoints that have Istio sidecars.

Under Prometheus v2.20.1 this works perfectly. Under Prometheus v2.21.0 and above it fails with "connection reset by peer."

You can follow along on my troubleshooting attempt in the newsgroup but I've reached a point where I can't figure it out and I think there's a bug in here somewhere.

What did you expect to see?

I expected v 2.28.0 to continue scraping Istio pods just like v2.20.1 did, using the same scrape configuration and the same certificates.

What did you see instead? Under which circumstances?

In versions 2.21.0 through 2.28.0 any endpoint using Istio mTLS fails to be scraped with the message "connection reset by peer." Here's the debug log message under v2.28.0:

level=debug ts=2021-07-06T20:58:32.984Z caller=scrape.go:1236 component="scrape manager" scrape_pool=kubernetes-pods-istio-secure target=https://10.244.3.10:9102/metrics msg="Scrape failed" err="Get \"https://10.244.3.10:9102/metrics\": read tcp 10.244.4.89:36666->10.244.3.10:9102: read: connection reset by peer"

Environment

  • System information: Linux 5.4.0-1047-azure x86_64
  • Prometheus version:
prometheus, version 2.28.0 (branch: HEAD, revision: ff58416a0b0224bab1f38f949f7d7c2a0f658940)
  build user:       root@32b9079a2740
  build date:       20210621-15:45:36
  go version:       go1.16.5
  platform:         linux/amd64
  • Prometheus configuration file:

The relevant scrape job is here. The certificates are mounted at /etc/istio-certs. I have validated that the certificate files are there and properly mounted.

    - job_name: kubernetes-pods-istio-secure
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - action: keep
        regex: true
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_scrape
      - action: keep
        regex: (([^;]+);([^;]*))|(([^;]*);(true))
        source_labels:
        - __meta_kubernetes_pod_annotation_sidecar_istio_io_status
        - __meta_kubernetes_pod_annotation_istio_mtls
      - action: replace
        regex: (.+)
        source_labels:
        - __meta_kubernetes_pod_annotation_prometheus_io_path
        target_label: __metrics_path__
      - action: keep
        regex: ([^:]+):(\d+)
        source_labels:
        - __address__
      - action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        source_labels:
        - __address__
        - __meta_kubernetes_pod_annotation_prometheus_io_port
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod_name
      scheme: https
      tls_config:
        ca_file: /etc/istio-certs/root-cert.pem
        cert_file: /etc/istio-certs/cert-chain.pem
        insecure_skip_verify: true
        key_file: /etc/istio-certs/key.pem
  • Logs:
level=debug ts=2021-07-06T20:58:32.984Z caller=scrape.go:1236 component="scrape manager" scrape_pool=kubernetes-pods-istio-secure target=https://10.244.3.10:9102/metrics msg="Scrape failed" err="Get \"https://10.244.3.10:9102/metrics\": read tcp 10.244.4.89:36666->10.244.3.10:9102: read: connection reset by peer"

Additional context / things I've tried:

I noticed in v2.21.0 that several things changed, and I'm not sure if any of them affect this issue.

  • The Go version was updated to 1.15
  • There were some challenges around HTTP/2 which caused it to be disabled

I have tried setting GODEBUG=x509ignoreCN=0 on the pod to see if it's the Go certificate common name handling that was causing the issue. It didn't help.

I've verified that v2.20.1 is definitely working and none of the versions above that work. I've tried them all.

I've created a different container with both curl and openssl in them and mounted the certificates there just to make sure it wasn't a weird mounting problem. Both curl and openssl work.

curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure

openssl s_client -connect 10.244.3.10:9102 -cert /etc/istio-certs/cert-chain.pem  -key /etc/istio-certs/key.pem -CAfile /etc/istio-certs/root-cert.pem -alpn "istio"

I noticed openssl doesn't work unless you set that alpn flag. I saw #6910 and thought this may be related, but I'm unsure. The fix for that one says it'll be out in 2.19.0 but that hasn't been released yet.

Relevant curl output:

root@sleep-5f98748557-s4wh5:/# curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure -v
*   Trying 10.244.3.10:9102...
* TCP_NODELAY set
* Connected to 10.244.3.10 (10.244.3.10) port 9102 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/istio-certs/root-cert.pem
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: [NONE]
*  start date: Jul  7 20:21:33 2021 GMT
*  expire date: Jul  8 20:21:33 2021 GMT
*  issuer: O=cluster.local
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x564d80d81e10)
> GET /metrics HTTP/2
> Host: 10.244.3.10:9102
> user-agent: curl/7.68.0
> accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 2147483647)!
< HTTP/2 200
@LeviHarrison
Copy link
Member

As you've found, the v2.21.0 changelog only contains two changes that could affect this, the move to Go 1.15 (and as a result the depreciation of CommonName), and disabling HTTP/2, which remains to this day. Through your testing, I think we can rule out the CommonName issue, which leaves the ladder.

I wonder what would happen if you removed config_util.WithHTTP2Disabled() from these lines and rebuilt the container (https://github.com/prometheus/prometheus#building-the-docker-image).

client, err := config_util.NewClientFromConfig(cfg.HTTPClientConfig, cfg.JobName, config_util.WithHTTP2Disabled())

client, err := config_util.NewClientFromConfig(cfg.HTTPClientConfig, cfg.JobName, config_util.WithHTTP2Disabled())

@tillig
Copy link
Author

tillig commented Jul 8, 2021

It's a great question, but I don't have a Go environment set up at the moment. I'll have to set something up and give it a try, though I'm not sure how quickly that'll happen. I'll do my best.

@LeviHarrison
Copy link
Member

Ah no worries - if you asked me about something .NET I would be in the same boat. I would offer to build the image for you but security.

@roidelapluie
Copy link
Member

HTTP2 was removed because of multiple bugs causing harm to our users. I see that at least one of them is still open: golang/go#32388 .

@LeviHarrison
Copy link
Member

Yes, but possibly in this case not having it is also causing inconvenience. Maybe we should consider adding an optional flag as an experimental feature.

@roidelapluie
Copy link
Member

Yes, but possibly in this case not having it is also causing inconvenience. Maybe we should consider adding an optional flag as an experimental feature.

Yeah I will revive prometheus/common#286

@tillig
Copy link
Author

tillig commented Jul 9, 2021

Working on building tag v2.28.1 in Go 1.16.5 on Mac. I have the gnu-tar package installed as noted in the README and I'm following the Docker build instructions.

The promu crossbuild -p linux/amd64 command fails after downloading all the dependencies.

... [truncated huge list of downloads] ...
go: downloading github.com/PuerkitoBio/urlesc v0.0.0-20170810143723-de5bf2ad4578
go: downloading github.com/Azure/go-autorest/autorest/validation v0.3.1
go: downloading github.com/Azure/go-autorest/autorest/to v0.4.0
go build github.com/aws/aws-sdk-go/service/ec2: /usr/local/go/pkg/tool/linux_amd64/compile: signal: killed
!! command failed: build -o .build/linux-amd64/prometheus -ldflags -X github.com/prometheus/common/version.Version=2.28.1 -X github.com/prometheus/common/version.Revision=b0944590a1c9a6b35dc5a696869f75f422b107a1 -X github.com/prometheus/common/version.Branch=HEAD -X github.com/prometheus/common/version.BuildUser=root@76a91e410d00 -X github.com/prometheus/common/version.BuildDate=20210709-14:47:03  -extldflags '-static' -a -tags netgo,builtinassets github.com/prometheus/prometheus/cmd/prometheus: exit status 1
make: *** [Makefile.common:227: common-build] Error 1
!! The base builder docker image exited unexpectedly: exit status 2

I'll see if I can continue troubleshooting the build, but that's where I am as far as trying to get a custom version up and running and tested.

@tillig
Copy link
Author

tillig commented Jul 9, 2021

Seems to be something about building linux/amd64 on Mac. I've got Intel hardware, not Apple silicon, but still. Looking in the .promu.yml only linux is listed as a target so I tried promu crossbuild -p linux and got a little further, but when it hit a point where it tried to specifically execute against linux/amd64 inside the build it hit the same failure.

go: downloading github.com/PuerkitoBio/urlesc v0.0.0-20170810143723-de5bf2ad4578
go: downloading github.com/Azure/go-autorest/autorest/validation v0.3.1
go: downloading github.com/Azure/go-autorest/autorest/to v0.4.0
 >   promtool
go: downloading github.com/google/pprof v0.0.0-20210609004039-a478d1d731e9
# linux-amd64
>> writing assets
# Un-setting GOOS and GOARCH here because the generated Go code is always the same,
# but the cached object code is incompatible between architectures and OSes (which
# breaks cross-building for different combinations on CI in the same container).
cd web/ui && GO111MODULE=on GOOS= GOARCH= go generate -x -v
doc.go
go run assets_generate.go
ui.go
>> building binaries
GO111MODULE=on /go/bin/promu build --prefix .build/linux-amd64
 >   prometheus
 >   promtool
go build github.com/aws/aws-sdk-go/service/ec2: /usr/local/go/pkg/tool/linux_amd64/compile: signal: killed
!! command failed: build -o .build/linux-amd64/promtool -ldflags -X github.com/prometheus/common/version.Version=2.28.0 -X github.com/prometheus/common/version.Revision=dc8f50559534a0820823c27ded103f3cee4b2af4 -X github.com/prometheus/common/version.Branch=main -X github.com/prometheus/common/version.BuildUser=root@77e0bfa6da0f -X github.com/prometheus/common/version.BuildDate=20210709-17:28:48  -extldflags '-static' -a -tags netgo,builtinassets github.com/prometheus/prometheus/cmd/promtool: exit status 1
make: *** [Makefile.common:236: common-build] Error 1
!! The base builder docker image exited unexpectedly: exit status 2

I guess I'll see if I can get a Linux VM on Azure or something to try building with.

@LeviHarrison
Copy link
Member

So sorry about this, I've hit similar build issues before on Mac. I can get a Linux binary built through the project's CI system, and then hopefully it will be a simpler process to just add it to a docker image.

@tillig
Copy link
Author

tillig commented Jul 9, 2021

OK, it took a bit but I was able to build a custom Prometheus container using a Linux VM.

I built based on the current main. I updated scrape/scrape.go as instructed:

tillig@tillig-prom-build:~/go/src/github.com/prometheus/prometheus$ cat scrape/scrape.go | grep -n NewClientFromConfig
271:	client, err := config_util.NewClientFromConfig(cfg.HTTPClientConfig, cfg.JobName)
378:	client, err := config_util.NewClientFromConfig(cfg.HTTPClientConfig, cfg.JobName)

I then deployed my custom container and I still get connection reset by peer.

I backed it out and pushed official container v2.20.1 in just in case and, sure enough, it started working again.

Maybe those two lines aren't enough to turn HTTP/2 back on... or maybe it's something else?

@LeviHarrison
Copy link
Member

LeviHarrison commented Jul 9, 2021

As long as there error message is still coming from the same line (scrape.go:1236), I think sadly HTTP/2 is not the issue. Sorry to put you through all this trouble.

Besides that, I don't really see anything else that could have affected it on the changelog, except maybe the move to Go 1.15. If it's not too much trouble, maybe try building with Go 1.14?

In .promu.yml change the Go version from 1.16to 1.14. I just built and everything seems to work fine.

@LeviHarrison
Copy link
Member

Actually maybe not, building with promu -p linux/amd64 seems to return some React errors.

@tillig
Copy link
Author

tillig commented Jul 9, 2021

Yup, the error is exactly the same line - scrape.go:1236. For grins I tried putting that GODEBUG thing for the common name back in, just in case it was a double whammy and no luck - still connection reset by peer. 🤔 I have to admit protocol troubleshooting is not my strong suit. I'm hoping I don't need to dive into some sort of tcpdump thing but maybe that's where we end up.

If necessary, I can provide some scripts to help deploy Istio and Prometheus in a configuration for testing.

@LeviHarrison
Copy link
Member

That would be great, thanks!

FROM golang:1.14

WORKDIR /go/src/prometheus
COPY . .

RUN go get -v ./...
RUN go install -v ./...

ENTRYPOINT ["./prometheus"]
CMD	[ "--config.file=/etc/prometheus/prometheus.yml", \
             "--storage.tsdb.path=/prometheus", \
             "--web.console.libraries=/usr/share/prometheus/console_libraries", \
             "--web.console.templates=/usr/share/prometheus/consoles" ]

I did just write up this Dockerfile for building on Go 1.14 that does work, so that might be worth a shot.

@tillig
Copy link
Author

tillig commented Jul 9, 2021

I created a repo over here with some scripts and config to set up a barebones cluster with just Istio and Prometheus to demonstrate the issue. The only thing I didn't provide was a test app that you can configure to scrape; if I need to make something I can, but I figured you likely have something. Let me know if something doesn't work; I set up a whole fresh cluster and verified it, but I'm on Mac so there may be little bash-isms or something that I got wrong.

@LeviHarrison
Copy link
Member

Thank you so much, that looks great!

@tillig
Copy link
Author

tillig commented Jul 9, 2021

I'm changing my mind and now I think HTTP/2 enabling does fix it.

In making that repro, I created a whole new cluster with just Istio and Prometheus. Deploy v2.20.1 - scrapes fine. v2.28.1 - fails. Deploy my main-with-http2 container that I'd built in my VM... it works.

Current hypothesis is that the imagePullPolicy was getting me - like a cached version of the container was being used instead of my freshly built one with the changes. Either that or Istio was having some troubles propagating policy around to the Envoy proxies (some sort of eventual consistency problem). In any case, I'm seeing green lights right now with my custom container.

main built using go 1.16.5 on Ubuntu 20.04.

@LeviHarrison
Copy link
Member

That's a relief, glad it's working for you now. For the time being, I guess you could use that custom container, it should be stable enough... maybe? You may want to make the same change on the 2.28.1 release tag just to be safe, although I don't think anything big has been merged since.

@roidelapluie and I have been talking about ways to re-enable HTTP/2 and hopefully it will be out in the next release.

@LeviHarrison
Copy link
Member

Weird, I set up Istio with the scripts you provided and a demo app, and scraping is working perfectly.

image

@tillig
Copy link
Author

tillig commented Jul 9, 2021

Your app is in the default namespace, not the test-app namespace. It's being scraped as a non-Istio microservice because the sidecar isn't injected. If you deploy to the test-app namespace you will see a ton more labels and it'll be scraped by a different job.

@LeviHarrison
Copy link
Member

Ahhhhhhhhh 🤦🏻‍♂️ I had changed the namespace in every single config except the one that worked.

Get "https://10.1.67.80:8080/metrics": read tcp 10.1.67.91:41460->10.1.67.80:8080: read: connection reset by peer

@LeviHarrison
Copy link
Member

LeviHarrison commented Jul 9, 2021

I can confirm that enabling HTTP/2 does resolve the problem. Thanks for the easy setup!

@tillig
Copy link
Author

tillig commented Jul 9, 2021

Anytime! I'm glad I could help figure it out... and that it wasn't just me! 😆 Thanks for taking the time to look into it, I really do appreciate it.

@LeviHarrison
Copy link
Member

LeviHarrison commented Jul 9, 2021

It was fun! I didn't know anything about Istio until now.

@roidelapluie
Copy link
Member

To go to the bottom of this, @LeviHarrison could you provide some tcpdumps with and without HTTP2 in prometheus? Thanks!

@LeviHarrison
Copy link
Member

LeviHarrison commented Aug 9, 2021

Here are two with and without HTTP/2. Hopefully, they have all the information needed. If not please let me know.

tcpdumps.zip

@TamasNeumer
Copy link

@tillig

Thanks for investigating this issue. I'm currently also on the topic of enabling mTLS for our monitoring stack (kube-prometheus).

I was wondering if you focus only on achieving mTLS while scraping with Prometheus, or did you also manage to get mTLS between Prometheus and AlertManager as well?

@tillig
Copy link
Author

tillig commented Aug 30, 2021

I haven't got mTLS with anything working at the moment. This would be a question for the Prometheus team. Also, I'm not using Alertmanager so when I do get things working I still won't have the answer.

@sakajunquality
Copy link

@tillig any updates on this?

@tillig
Copy link
Author

tillig commented Sep 21, 2021

@sakajunquality Not sure why I'd have any updates on it - I researched it but I'm not doing the coding.

@roidelapluie
Copy link
Member

This will work in Prometheus 2.31 without magic.

roidelapluie added a commit to roidelapluie/prometheus that referenced this issue Sep 26, 2021
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream
in go, and we have also enabled ReadIdleTimeout.

Fix prometheus#7588
Fix prometheus#9068

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
roidelapluie added a commit to roidelapluie/prometheus that referenced this issue Sep 26, 2021
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream
in go, and we have also enabled ReadIdleTimeout.

Fix prometheus#7588
Fix prometheus#9068

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
roidelapluie added a commit that referenced this issue Sep 26, 2021
We are re-enabling HTTP 2 again. There has been a few bugfixes upstream
in go, and we have also enabled ReadIdleTimeout.

Fix #7588
Fix #9068

Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
@Sunil-Jacob
Copy link

@tillig I face the same issue, when in a namespace with label istio-injection=enabled and mTLS on, prometheus fails to scrape.
Am currently using Istio version 1.11.1 and Prometheus 2.26.0.
Were you able to solve the problem?

@tillig
Copy link
Author

tillig commented Oct 22, 2021

I have not revisited scraping metrics using mTLS yet. However, Istio introduced metrics merging to solve some of this, I'd recommend checking it out. https://istio.io/latest/docs/ops/integrations/prometheus/#option-1-metrics-merging

@roidelapluie
Copy link
Member

We are releasing Prometheus 2.31.0-rc.0 today that will fix the issues with istio

@Sunil-Jacob
Copy link

I have not revisited scraping metrics using mTLS yet. However, Istio introduced metrics merging to solve some of this, I'd recommend checking it out. https://istio.io/latest/docs/ops/integrations/prometheus/#option-1-metrics-merging

sure will check out.

@roidelapluie
Copy link
Member

Prometheus 2.31 is released and it should work directly here.

@Sunil-Jacob
Copy link

Prometheus 2.31 is released and it should work directly here.

Thanks will install this

@prometheus prometheus locked as resolved and limited conversation to collaborators May 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants