Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cilium Connectivity test using external dns lookups failing when bpf masquarade enabled in native routing mode #32559

Open
3 tasks done
jspaleta opened this issue May 15, 2024 · 1 comment
Labels
kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.

Comments

@jspaleta
Copy link
Contributor

jspaleta commented May 15, 2024

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

While doing due diligence for issue #32525 I've run into a reproducible connectivity test involving external dns lookups using the same baseline native routing with bpf masquerade enabled baseline environment.

Cilium Version

cilium 1.15.5 and 1.15.4 have been tested and having reproducible connectivity test failures
cilium 1.14.10 has been tested it also has connectivity test failures

Kernel Version

Linux localhost 6.7.4-200.fc39.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Feb 5 22:21:14 UTC 2024 x86_64 GNU/Linux

Kubernetes Version

Kind cluster using:
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.2

Regression

Does not appear to be a regression, 1.14.10 also fails for me reproducible.

Sysdump

1.15.5 sysdump from first failed action for --test='client-egress-l7-named-port/pod-to-world/*'
cilium-sysdump-20240515-112214.zip

Relevant log output

$ cilium connectivity test --verbose --test='client-egress-l7-named-port/pod-to-world/*' --collect-sysdump-on-failure
...
📋 Test Report
❌ 1/1 tests failed (1/9 actions), 77 tests skipped, 1 scenarios skipped:
Test [client-egress-l7-named-port]:
  ❌ client-egress-l7-named-port/pod-to-world/http-to-one.one.one.one.-0: cilium-test/client2-ccd7b8bdf-mnnck (10.9.0.49) -> one.one.one.one.-http (one.one.one.one.:80)
connectivity test failed: 1 tests failed

Anything else?

I'm using same environment I used in #32525

failing cilium config on 1.15.4 and 1.15.5:

## baseline
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: '10.9.0.0/16'
autoDirectNodeRoutes: true
ingressController:
  # -- Enable cilium ingress controller
  enabled: true
  default: true
  loadbalancerMode: dedicated
gatewayAPI:
  enabled: true
operator:
  replicas: 1
l2announcements:
  enabled: true
ipam:
  mode: 'cluster-pool'
  operator:
    clusterPoolIPv4PodCIDRList:
      - '10.9.0.0/16'
    clusterPoolIPv4MaskSize: 24

## Under test  
bpf:
  masquerade: true
  legacyHostRouting: true

passing cilium config on 1.15.4 and 1.15.5:

## baseline
kubeProxyReplacement: true
routingMode: native
ipv4NativeRoutingCIDR: '10.9.0.0/16'
autoDirectNodeRoutes: true
ingressController:
  # -- Enable cilium ingress controller
  enabled: true
  default: true
  loadbalancerMode: dedicated
gatewayAPI:
  enabled: true
operator:
  replicas: 1
l2announcements:
  enabled: true
ipam:
  mode: 'cluster-pool'
  operator:
    clusterPoolIPv4PodCIDRList:
      - '10.9.0.0/16'
    clusterPoolIPv4MaskSize: 24

## Under test  
bpf:
  masquerade: false
  legacyHostRouting: true

the bpf.legacyHostRouting option value has no impact in 1.15.4 or 1.15.5 test results.

The config that fails for 1.15.4 and 1.15.5 above works in cilium 1.14.10

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jspaleta jspaleta added kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. kind/community-report This was reported by a user in the Cilium community, eg via Slack. kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. and removed kind/regression This functionality worked fine before, but was broken in a newer release of Cilium. labels May 15, 2024
@jspaleta
Copy link
Contributor Author

jspaleta commented May 15, 2024

(Updated)
I originally reported this as a regression relative to 1.14.10... that's wrong

clean retests on 1.14.10 and I'm getting consistent behavior with tests on 1.15.4 and 1.15.5

Summary of test matrix assuming common baseline native routing configuration as provided above:

## Under Test

## fails:
#bpf:
#  masquerade: true 
#  legacyHostRouting: true

#bpf:
#  masquerade: true 
#  legacyHostRouting: false

## passes:
#bpf:
#  masquerade: false
#  legacyHostRouting: false
#bpf:
#  masquerade: false 
#  legacyHostRouting: true

@squeed squeed removed the kind/community-report This was reported by a user in the Cilium community, eg via Slack. label May 16, 2024
@lmb lmb added the sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. needs/triage This issue requires triaging to establish severity and next steps. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

No branches or pull requests

3 participants