Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress controller load balancer can not connect to nodes #32556

Open
2 of 3 tasks
carlosrejano opened this issue May 15, 2024 · 4 comments
Open
2 of 3 tasks

Ingress controller load balancer can not connect to nodes #32556

carlosrejano opened this issue May 15, 2024 · 4 comments
Labels
feature/k8s-ingress info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related.

Comments

@carlosrejano
Copy link

carlosrejano commented May 15, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

We have an EKS cluster where we are trying to use Cilium ingress controller and the load balancer created for the ingress can not always connect to the nodes.

What we see is that the load balancer can connect to some nodes during periods but is not a consistent behavior and there is no pattern between the nodes behind that it can connect and the ones that can not.

Checking directly in the nodes also connecting to the nodePort opened for the load balancer does not work so should not be a problem of security groups, anyway we tried opening traffic from every internal address and nothing, some nodes work and others not or even sometimes no nodes happen to be accessible by the load balancer.

I checked and all the nodes have this cilium LB configuration for the nodePort:

10.218.248.217:31799   0.0.0.0:0 (331) (0) [NodePort, l7-load-balancer]
0.0.0.0:31799          0.0.0.0:0 (333) (0) [NodePort, non-routable, l7-load-balancer]
10.0.243.9:31799       0.0.0.0:0 (330) (0) [NodePort, l7-load-balancer]
169.254.0.11:31799     0.0.0.0:0 (332) (0) [NodePort, l7-load-balancer]

Configuration values used:

cni:
  configMap: cni-config
  customConf: true
eni:
  enabled: true
  updateEC2AdapterLimitViaAPI: true
  awsEnablePrefixDelegation: true
  awsReleaseExcessIPs: true
egressMasqueradeInterfaces: eth0
policyEnforcementMode: "never"
ipam:
  mode: eni
hubble:
  relay:
    enabled: true
  ui:
    enabled: true
tunnelProtocol: ""
nodePort:
  enabled: true
nodeinit:
  enabled: true
ingressController:
  enabled: true

cni-config configmap values:

    {
      "cniVersion":"0.3.1",
      "name":"cilium",
      "plugins": [
        {
          "cniVersion":"0.3.1",
          "type":"cilium-cni",
          "eni": {
            "subnet-ids": ["subnet-xxxxxx", "subnet-xxxxxx", "subnet-xxxxxxx"],
            "first-interface-index": 1
          }
        }
      ]
    }

Cilium Version

We tried it in multiple versions:

  • 1.14.1
  • 1.15.4
  • 1.16.0-pre.0

Kernel Version

Linux 5.10.215-203.850.amzn2.aarch64

Kubernetes Version

v1.26.15

Regression

No response

Sysdump

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@carlosrejano carlosrejano added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 15, 2024
@squeed
Copy link
Contributor

squeed commented May 16, 2024

Hi there, thanks for the bug report. It's not yet clear to me how exactly traffic is flowing. Could you outline the expected traffic flow, and indicate where you think it is failing?

In particular, I suggest the section on troubleshooting with hubble to identify where packets are being dropped. Can you go through the troubleshooting section and clarify the problem a bit?

Thanks.

@squeed squeed added the need-more-info More information is required to further debug or fix the issue. label May 16, 2024
@sayboras
Copy link
Member

Also can you share your cilium configmap as well? Thanks.

@carlosrejano
Copy link
Author

carlosrejano commented May 23, 2024

Hi there, thanks for the bug report. It's not yet clear to me how exactly traffic is flowing. Could you outline the expected traffic flow, and indicate where you think it is failing?

In particular, I suggest the section on troubleshooting with hubble to identify where packets are being dropped. Can you go through the troubleshooting section and clarify the problem a bit?

Thanks.

@squeed
Hi, sorry for the delay, yes let me explain it better. Correct me if I mention something wrong. The idea is to use Cilium as an Ingress Controller, when I create an ingress object it creates the Classic AWS LB or NLB, tried both, which will balance the traffic to the Cilium ingress controller. If I'm not wrong the component of Cilium that handles the traffic coming from the LB is cilium-envoy which runs inside cilium-agent in my case. The traffic after arriving to cilium-envoy gets sent to the relevant backend of the Ingress. My problem is the communication between the Load Balancer and envoy, the Load Balancer can not target envoy most of the time.

Ask any other question that you need if I still did not explain it well enough.

Thanks for taking a look into this!

@github-actions github-actions bot added info-completed The GH issue has received a reply from the author and removed need-more-info More information is required to further debug or fix the issue. labels May 23, 2024
@carlosrejano
Copy link
Author

Also can you share your cilium configmap as well? Thanks.

@sayboras Yes, here it is:

  agent-not-ready-taint-key: node.cilium.io/agent-not-ready
  arping-refresh-period: 30s
  auto-direct-node-routes: "false"
  bpf-lb-acceleration: disabled
  bpf-lb-external-clusterip: "false"
  bpf-lb-map-max: "65536"
  bpf-lb-sock: "false"
  bpf-map-dynamic-size-ratio: "0.0025"
  bpf-policy-map-max: "16384"
  bpf-root: /sys/fs/bpf
  cgroup-root: /run/cilium/cgroupv2
  cilium-endpoint-gc-interval: 5m0s
  cluster-id: "0"
  cluster-name: default
  cluster-pool-ipv4-cidr: 10.0.0.0/8
  cluster-pool-ipv4-mask-size: "24"
  cni-chaining-mode: aws-cni
  cni-exclusive: "false"
  cni-log-file: /var/run/cilium/cilium-cni.log
  custom-cni-conf: "false"
  debug: "false"
  debug-verbose: ""
  egress-gateway-reconciliation-trigger-interval: 1s
  enable-auto-protect-node-port-range: "true"
  enable-bgp-control-plane: "false"
  enable-bpf-clock-probe: "false"
  enable-endpoint-health-checking: "false"
  enable-endpoint-routes: "true"
  enable-envoy-config: "true"
  enable-external-ips: "false"
  enable-gateway-api: "true"
  enable-gateway-api-secrets-sync: "true"
  enable-health-check-loadbalancer-ip: "false"
  enable-health-check-nodeport: "true"
  enable-health-checking: "true"
  enable-host-legacy-routing: "true"
  enable-host-port: "false"
  enable-hubble: "true"
  enable-ingress-controller: "true"
  enable-ingress-proxy-protocol: "false"
  enable-ingress-secrets-sync: "true"
  enable-ipv4: "true"
  enable-ipv4-big-tcp: "false"
  enable-ipv4-masquerade: "false"
  enable-ipv6: "false"
  enable-ipv6-big-tcp: "false"
  enable-ipv6-masquerade: "true"
  enable-k8s-networkpolicy: "true"
  enable-k8s-terminating-endpoint: "true"
  enable-l2-neigh-discovery: "true"
  enable-l7-proxy: "true"
  enable-local-node-route: "false"
  enable-local-redirect-policy: "false"
  enable-masquerade-to-route-source: "false"
  enable-metrics: "true"
  enable-node-port: "true"
  enable-policy: never
  enable-remote-node-identity: "true"
  enable-sctp: "false"
  enable-svc-source-range-check: "true"
  enable-vtep: "false"
  enable-well-known-identities: "false"
  enable-xt-socket-fallback: "true"
  enforce-ingress-https: "true"
  external-envoy-proxy: "false"
  gateway-api-secrets-namespace: cilium-secrets
  hubble-disable-tls: "false"
  hubble-export-file-max-backups: "5"
  hubble-export-file-max-size-mb: "10"
  hubble-listen-address: :4244
  hubble-socket-path: /var/run/cilium/hubble.sock
  hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt
  hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
  hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
  identity-allocation-mode: crd
  identity-gc-interval: 15m0s
  identity-heartbeat-timeout: 30m0s
  ingress-default-lb-mode: dedicated
  ingress-lb-annotation-prefixes: service.beta.kubernetes.io service.kubernetes.io
    cloud.google.com
  ingress-secrets-namespace: cilium-secrets
  ingress-shared-lb-service-name: cilium-ingress
  install-no-conntrack-iptables-rules: "false"
  ipam: cluster-pool
  ipam-cilium-node-update-rate: 15s
  k8s-client-burst: "10"
  k8s-client-qps: "5"
  kube-proxy-replacement: "false"
  kube-proxy-replacement-healthz-bind-address: ""
  max-connected-clusters: "255"
  mesh-auth-enabled: "true"
  mesh-auth-gc-interval: 5m0s
  mesh-auth-queue-size: "1024"
  mesh-auth-rotated-identities-queue-size: "1024"
  monitor-aggregation: medium
  monitor-aggregation-flags: all
  monitor-aggregation-interval: 5s
  node-port-bind-protection: "true"
  nodes-gc-interval: 5m0s
  operator-api-serve-addr: 127.0.0.1:9234
  operator-prometheus-serve-addr: :9963
  policy-cidr-match-mode: ""
  preallocate-bpf-maps: "false"
  procfs: /host/proc
  proxy-connect-timeout: "2"
  proxy-idle-timeout-seconds: "60"
  proxy-max-connection-duration-seconds: "0"
  proxy-max-requests-per-connection: "0"
  proxy-prometheus-port: "9964"
  proxy-xff-num-trusted-hops-egress: "0"
  proxy-xff-num-trusted-hops-ingress: "0"
  remove-cilium-node-taints: "true"
  routing-mode: native
  service-no-backend-response: reject
  set-cilium-is-up-condition: "true"
  set-cilium-node-taints: "true"
  sidecar-istio-proxy-image: cilium/istio_proxy
  skip-cnp-status-startup-clean: "false"
  synchronize-k8s-nodes: "true"
  tofqdns-dns-reject-response-code: refused
  tofqdns-enable-dns-compression: "true"
  tofqdns-endpoint-max-ip-per-hostname: "50"
  tofqdns-idle-connection-grace-period: 0s
  tofqdns-max-deferred-connection-deletes: "10000"
  tofqdns-proxy-response-max-delay: 100ms
  unmanaged-pod-watcher-interval: "15"
  vtep-cidr: ""
  vtep-endpoint: ""
  vtep-mac: ""
  vtep-mask: ""
  write-cni-conf-when-ready: /host/etc/cni/net.d/05-cilium.conflist

Thank you for taking a look into this!

@lmb lmb added sig/agent Cilium agent related. feature/k8s-ingress labels May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/k8s-ingress info-completed The GH issue has received a reply from the author kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

4 participants