Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy: Optionally do privileged configs only #120864

Merged
merged 1 commit into from Oct 25, 2023

Conversation

uablrek
Copy link
Contributor

@uablrek uablrek commented Sep 25, 2023

A new --init-only flag is added that makes kube-proxy perform configuration that requires privileged mode and exit. It is intended to be executed in a privileged initContainer, while the main container may run with a stricter securityContext

What type of PR is this?

/kind feature
/sig network
/area kube-proxy

What this PR does / why we need it:

Users don't like containers running with privileged: true. This PR allows the configs that requires privileged: true to run in an initContainer.

Which issue(s) this PR fixes:

Fixes #112171

Well, it solves the problem, but nothing documented (yet)

Special notes for your reviewer:

Tested with an initContainer similar to

initContainers:
  - name: kube-proxy-init
    command:
    - /usr/local/bin/kube-proxy
    - --config=/var/lib/kube-proxy/config.conf
    - --init-only
    image: registry.k8s.io/kube-proxy:local
    imagePullPolicy: IfNotPresent
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /var/lib/kube-proxy
      name: kube-proxy

And the main kube-proxy container runs with:

        securityContext:
          capabilities:
            add: ["NET_ADMIN", "SYS_RESOURCE"]

Does this PR introduce a user-facing change?

Added a new `--init-only` command line flag to `kube-proxy`. Setting the flag makes `kube-proxy` perform its initial configuration that requires privileged mode, and then exit. The `--init-only` mode is intended to be executed in a privileged init container, so that the main container may run with a stricter `securityContext`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. sig/network Categorizes an issue or PR as relevant to SIG Network. area/kube-proxy needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 25, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/ipvs labels Sep 25, 2023
@uablrek
Copy link
Contributor Author

uablrek commented Sep 25, 2023

/cc @aojea @ialidzhikov @danwinship

This should be possible to test in kind, but I don't know how to alter the kube-proxy manifest (I can create an image with my kube-proxy though).

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 25, 2023
@uablrek
Copy link
Contributor Author

uablrek commented Sep 25, 2023

What is pkg/proxy/kubemark/hollow_proxy.go? Is it used?

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Sep 25, 2023
@uablrek
Copy link
Contributor Author

uablrek commented Sep 25, 2023

/retest

@danwinship
Copy link
Contributor

What is pkg/proxy/kubemark/hollow_proxy.go? Is it used?

It's part of cmd/kubemark, which is some benchmarking thing

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it would be cleaner to split out a PrivilegedInit method in each backend instead, and call that from platformSetup.

Then it's not relying on the code being properly ordered in NewProxier (because there's a bunch of unprivileged stuff at the top of the iptables NewProxier code currently...)

func (s *ProxyServer) createProxier(config *proxyconfigapi.KubeProxyConfiguration, dualStackMode, initOnly bool) (proxy.Provider, error) {
if initOnly {
// --init-only is not implemented on Windows
return nil, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be an error then rather than nil?

I guess that depends on whether it's possible to run kube-proxy in winkernel mode in a non-privileged pod, which I don't know. (Does windows kubernetes even have any equivalent of capabilities?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayunit100 @jsturtevant can you help us to answer these questions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube-proxy on windows needs to be run in privileged mode. either an error or log statement would be helpful here if someone specified this field. As it is now it would be confusing or lead to expectation that it could work as it silently runs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be an error then rather than nil?

No, the init can go wrong and return an error. But if it's ok it should return err==nil. But the returned proxier is nil to indicate that only initialization is done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you mean on Windows (missed that). Then, yes it should. I will update.

@aojea
Copy link
Member

aojea commented Sep 25, 2023

This should be possible to test in kind, but I don't know how to alter the kube-proxy manifest (I can create an image with my kube-proxy though).

kubectl edit ds kube-proxy -n kube-system

@uablrek
Copy link
Contributor Author

uablrek commented Sep 25, 2023

It seems like it would be cleaner to split out a PrivilegedInit method in each backend instead, and call that from platformSetup.

That was also my first approach, and IMO it was not cleaner at all. First a number of parameters passed in "NewProxy" must be re-created and passed to the PrivilegedInit method. These are different between backends, and it became quite messy.

IMO the current approach, with some unnecessary internal stuff, e.g api-client and things, is better since you can clearly see that the old path is exactly the same if --init-only is not specified. This makes it 100% backward compatible and more robust since nothing is accidently lost, like the conntracker init (and some wait-for-config logic), which you get "for free".

This is now a size/M PR with trivial change in logic, and most updates are to pass the initOnly bool down the stack. I can almost guarantee that it's Ok, or at least that is doesn't mess up current functionality. I would be far less certain about the far larger update a separate PrivilegedInit method would require.

@uablrek
Copy link
Contributor Author

uablrek commented Sep 25, 2023

@danwinship Your comment on nil,nil return disappeared. I must have accidentally clicked the wrong button, sorry. I hope you at least got my reply

@uablrek
Copy link
Contributor Author

uablrek commented Sep 25, 2023

kubectl edit ds kube-proxy -n kube-system

Can I just do an "kubectl apply -f updated-kube-proxy.yaml"? It is less manual work, and I can attach the manifest in this PR

@danwinship
Copy link
Contributor

@danwinship Your comment on nil,nil return disappeared. I must have accidentally clicked the wrong button, sorry. I hope you at least got my reply

no, you must have clicked "Mark resolved" instead of "Submit"...

@uablrek
Copy link
Contributor Author

uablrek commented Sep 26, 2023

no, you must have clicked "Mark resolved" instead of "Submit"...

Yes. It showed up again. I answered again and clicked an "unresolve" button. I think it's ok now

@uablrek
Copy link
Contributor Author

uablrek commented Sep 26, 2023

/test pull-kubernetes-e2e-capz-windows-master

@uablrek
Copy link
Contributor Author

uablrek commented Sep 26, 2023

To test in KinD:

kind build node-image --image kindest/node:main --kube-root $GOPATH/src/k8s.io/kubernetes
kind create cluster ... --image kindest/node:main

Then update the kube-proxy manifest (never mind the warning):

kubectl apply -f updated-kube-proxy.yaml
updated-kube-proxy.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    k8s-app: kube-proxy
  name: kube-proxy
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: kube-proxy
  template:
    metadata:
      creationTimestamp: null
      labels:
        k8s-app: kube-proxy
    spec:
      initContainers:
      - command:
        - /usr/local/bin/kube-proxy
        - --config=/var/lib/kube-proxy/config.conf
        - --init-only
        image: registry.k8s.io/kube-proxy:v1.29.0-alpha.0.927_bab024befbfec3
        imagePullPolicy: IfNotPresent
        name: kube-proxy-init
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kube-proxy
          name: kube-proxy
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
      containers:
      - command:
        - /usr/local/bin/kube-proxy
        - --config=/var/lib/kube-proxy/config.conf
        - --hostname-override=$(NODE_NAME)
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: registry.k8s.io/kube-proxy:v1.29.0-alpha.0.927_bab024befbfec3
        imagePullPolicy: IfNotPresent
        name: kube-proxy
        resources: {}
        securityContext:
          capabilities:
            add: ["NET_ADMIN", "SYS_RESOURCE"]
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kube-proxy
          name: kube-proxy
        - mountPath: /run/xtables.lock
          name: xtables-lock
        - mountPath: /lib/modules
          name: lib-modules
          readOnly: true
      dnsPolicy: ClusterFirst
      hostNetwork: true
      nodeSelector:
        kubernetes.io/os: linux
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kube-proxy
      serviceAccountName: kube-proxy
      terminationGracePeriodSeconds: 30
      tolerations:
      - operator: Exists
      volumes:
      - configMap:
          defaultMode: 420
          name: kube-proxy
        name: kube-proxy
      - hostPath:
          path: /run/xtables.lock
          type: FileOrCreate
        name: xtables-lock
      - hostPath:
          path: /lib/modules
          type: ""
        name: lib-modules
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate

Check logs:

> kubectl logs -n kube-system kube-proxy-vgqt8 -c kube-proxy-init
I0926 05:09:50.021445       1 server.go:1031] "Successfully retrieved node IP(s)" IPs=["172.18.0.4","fc00:f853:ccd:e793::4"]
I0926 05:09:50.035111       1 server.go:640] "kube-proxy running in dual-stack mode" primary ipFamily="IPv4"
I0926 05:09:50.042165       1 server_others.go:220] "Using ipvs Proxier"
I0926 05:09:50.042323       1 proxier.go:395] "System initiated and --init-only specified"
I0926 05:09:50.042403       1 proxier.go:395] "System initiated and --init-only specified"

@danwinship
Copy link
Contributor

And the main kube-proxy container runs with:

        securityContext:
          capabilities:
            add: ["NET_ADMIN", "SYS_RESOURCE"]

So this is the tricky bit: once we allow users to do this, it becomes a pretty-much-unbreakable compatibility constraint for point releases, and a very strong compatibility constraint even between major releases. So are we pretty sure that those two capabilities are enough and we won't need more in the future?

(What do we need SYS_RESOURCE for anyway? I would have thought NET_ADMIN would cover everything after startup...)

Also, we should probably have e2e coverage to ensure that we don't regress. Probably we could just change the existing e2e to use the privileged/unprivileged split, since, as you say, given the code organization, it's unlike that we would break things in the future in such a way that the split way kept working but the all-in-one way broke.

(Doing a small KEP to make sure we've figured out all the potential problems here would not be an awful idea...)

The kube-proxy configuration that require "privileged: true" can be done in an initContainer

gonna need a better release note than that... maybe even a blog post!

@uablrek
Copy link
Contributor Author

uablrek commented Oct 24, 2023

/retest

@@ -257,6 +258,11 @@ func NewProxier(ipFamily v1.IPFamily,
klog.InfoS("nf_conntrack_tcp_be_liberal set, not installing DROP rules for INVALID packets")
}

if initOnly {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, we only have to plumb the initOnly just for nf_conntrack_tcp_be_liberal and sysctlRouteLocalnet in the iptables mode :/

Copy link
Contributor Author

@uablrek uablrek Oct 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is not, and shouldn't be, any difference between start with --init-only up until Run() is called. The rationale is here #120864 (comment).

@uablrek
Copy link
Contributor Author

uablrek commented Oct 25, 2023

@danwinship If it's ok, please lgtm this before the v1.29 code-freeze at Nov 1. I am sketching on a blog post. I think a non-privileged kube-proxy can be combined with User Namespaces for even better security, but I haven't tested yet

@uablrek
Copy link
Contributor Author

uablrek commented Oct 25, 2023

@uablrek
Copy link
Contributor Author

uablrek commented Oct 25, 2023

User Namespaces can't be used for PODs with hostNetwork: true and NET_ADMIN capabilities. First it is forbidden:

The DaemonSet "kube-proxy" is invalid: spec.template.spec.hostNetwork: Forbidden: when `pod.Spec.HostUsers` is false

But even if it was allowed, NET_ADMIN would be rejected as explained in #111090 (comment):

userns + hostNetwork + CAP_NET_ADMIN: this won't give you capabilities on the host network (this is a limitation imposed by the linux kernel on how userns work)

Please see the discussion on: https://kubernetes.slack.com/archives/C0BP8PW9G/p1698236915253799:

@sftim
Copy link
Contributor

sftim commented Oct 25, 2023

Changelog suggestion

Added a new `--init-only` command line flag to `kube-proxy`. Setting the flag makes `kube-proxy` perform its initial configuration that requires privileged mode, and then exit. The `--init-only` mode is intended to be executed in a privileged init container, so that the main container may run with a stricter `securityContext`.

@@ -108,6 +108,8 @@ type Options struct {
WriteConfigTo string
// CleanupAndExit, when true, makes the proxy server clean up iptables and ipvs rules, then exit.
CleanupAndExit bool
// InitAndExit, when true, makes the proxy server makes configurations that need privileged access, then exit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: possibly typo here?

- InitAndExit, when true, makes the proxy server makes configurations that need privileged access, then exit.
+ InitAndExit, when true, makes the proxy server execute configurations that need privileged access, then exit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lars meant "it makes the proxy server make changes to the node configuration that need privileges", whereas I think you're interpreting it as "it makes the proxy server process the kube-proxy configuration options that need privileges".

Maybe just copy the text from the CLI flag below: "InitAndExit, when true, makes kube-proxy perform any initialization steps that must be done with full root privileges, and then exit"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters much. People that are interested in this function will get it anyway, so I let it be

@@ -168,7 +170,7 @@ func (o *Options) AddFlags(fs *pflag.FlagSet) {
"The purpose of this format is make sure you have the opportunity to notice if the next release hides additional metrics, "+
"rather than being surprised when they are permanently removed in the release after that. "+
"This parameter is ignored if a config file is specified by --config.")

fs.BoolVar(&o.InitAndExit, "init-only", o.InitAndExit, "If true, perform any initialization steps that must be done with full root privileges, and then exit. After doing this, you can run kube-proxy again with only the CAP_NET_ADMIN capability.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mentioned in kubernetes/website#43680, but I wonder if it makes sense to mention here that the changes done by privileged initialization step will persist, and therefore running kube-proxy again without permissions will work? Something like:

- If true, perform any initialization steps that must be done with full root privileges,
+ If true, perform and persist any initialization steps that must be done with full root privileges,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous comment.

@@ -402,6 +403,11 @@ func NewProxier(ipFamily v1.IPFamily,
}
}

if initOnly {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a comment here (and in iptables/proxies.go) that anything privileged should come before this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it might, but I leave it for now just to get it done 😃

@danwinship
Copy link
Contributor

/lgtm
/approve
/hold
if you want to fix the comment text, un-hold if not

(I still think we should refactor proxy setup/startup to more explicitly separate the "needs root" and "doesn't need root" steps, but I've got some pending refactoring coming that would conflict with that anyway, so I'll just do that later.)

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 25, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 064eb3d26baacefc0d3175fe5c5896023c10658c

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, uablrek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 25, 2023
@uablrek
Copy link
Contributor Author

uablrek commented Oct 25, 2023

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 25, 2023
@k8s-ci-robot k8s-ci-robot merged commit 06a7feb into kubernetes:master Oct 25, 2023
15 of 16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.29 milestone Oct 25, 2023
@uablrek uablrek deleted the kube-proxy-init branch October 26, 2023 04:17
@neolit123
Copy link
Member

@jsturtevant @danwinship

kube-proxy on windows needs to be run in privileged mode. either an error or log statement would be helpful here if someone specified this field. As it is now it would be confusing or lead to expectation that it could work as it silently runs.

instead of throwing an error, why wasn't the new flag made as a no-op on Windows with a warning?

ref kubeadm implementation discussion:
kubernetes/kubeadm#2948

@uablrek
Copy link
Contributor Author

uablrek commented Oct 26, 2023

instead of throwing an error, why wasn't the new flag made as a no-op on Windows with a warning?

Good point. It should be sufficient to log a warning and return nil, nil in:

if initOnly {
return nil, fmt.Errorf("--init-only is not implemented on Windows")
}

But I can't test on Windows.

@neolit123 If I create a PR on this, can you help with testing on Windows?

@neolit123
Copy link
Member

neolit123 commented Oct 26, 2023

instead of throwing an error, why wasn't the new flag made as a no-op on Windows with a warning?

Good point. It should be sufficient to log a warning and return nil, nil in:

if initOnly {
return nil, fmt.Errorf("--init-only is not implemented on Windows")
}

But I can't test on Windows.

@neolit123 If I create a PR on this, can you help with testing on Windows?

currently i cannot. but there should be a WIndows reviewer on the new PR regardless.
perhaps @jsturtevant can help with that.

i think the goal should be for a kube-proxy manifest with this new setup to be portable across OSes.

  • security context options on Linux are a no-op on Windows, i assume.
  • on Windows, the init container with the new flag can just show a warning and exit with status 0, but then the main container can be privileged.

does that sound good on paper?

@danwinship
Copy link
Contributor

  • on Windows, the init container with the new flag can just show a warning and exit with status 0, but then the main container can be privileged.

But the whole point is to have the main container not be privileged on Linux.

There needs to be some sort of alternative yaml / template system / whatever if you want to use this feature. On Linux we want a privileged initContainer and an unprivileged regular container, and on Windows we want no initContainer and a privileged regular container. Just making the initContainer work on Windows doesn't help.

@neolit123
Copy link
Member

  • on Windows, the init container with the new flag can just show a warning and exit with status 0, but then the main container can be privileged.

But the whole point is to have the main container not be privileged on Linux.

There needs to be some sort of alternative yaml / template system / whatever if you want to use this feature. On Linux we want a privileged initContainer and an unprivileged regular container, and on Windows we want no initContainer and a privileged regular container. Just making the initContainer work on Windows doesn't help.

i guess that's true.
this forces users and cluster tools to have to deploy two kube-proxy daemonsets - one for Windows and one for Linux nodes. because clusters with mixed OS nodes is a supported use case.

unless there is a way to do some OS specific YAML magic (not templating) and make the --init-only flag work in a single manifest, the feature enablement of --init-only sounds cumbersome.

@jsturtevant
Copy link
Contributor

jsturtevant commented Oct 26, 2023

we've (sig-windows) found that DS running across Windows and Linux sounds like a good idea but in practice is we often need separate DS's due too many minor differences.

The kubernetes project doesn't currently build kube-proxy as a container for Windows, only the binary, so there are additional steps for a Windows user that wants to run kube-proxy in a hostprocess container. Additionally there are some issues with kubeproxy's startup on windows that means you need to customize kube-proxy startup per CNI, I wrote a way to solve this but hasn't been worked on (https://docs.google.com/document/d/1A6Gkyx5EvL86Z4L2P4sxnk9HQjdRgnLXvBg6tIMDB3s/edit#heading=h.rqiiboge3f2e).

In the future, if we get to the point of building the containers with in the project, we may want to revisit this.

@jayunit100
Copy link
Member

Yeah there are also things like process priorities , dsr, topology implementation, configuration of labels for scheduling, .... which all make the windows kube proxy inherently different in terms of how you would want to configure it.

@sftim
Copy link
Contributor

sftim commented Oct 27, 2023

The kubernetes project doesn't currently build kube-proxy as a container for Windows, only the binary, so there are additional steps for a Windows user that wants to run kube-proxy in a hostprocess container.

Is there a feature request issue for that? I don't know if it's feasible to ship a Windows container image (proprietary OS issues?) so it might be a non-starter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ipvs area/kube-proxy cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document and test minimum privileges to needed run kube-proxy
10 participants