Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would like ClusterCIDR to be fetchable by pods. #46508

Closed
bboreham opened this issue May 26, 2017 · 59 comments
Closed

Would like ClusterCIDR to be fetchable by pods. #46508

bboreham opened this issue May 26, 2017 · 59 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@bboreham
Copy link
Contributor

FEATURE REQUEST

Currently, kube-proxy and kube-controller-manager each have a command-line parameter --cluster-cidr, described as "CIDR range of pods in the cluster". There is no API to fetch this parameter.

I work on a network add-on, Weave Net, which would benefit from being able to see this value. Currently we have our own setting for the range of IP addresses to give to pods, and it is frustrating to receive issue reports caused by the values being out of sync.

Could we have ClusterCIDR as a top-level value on the cluster, which is then read by kube-proxy and kube-controller-manager and anyone else who wants it?

And/or supply it via the downward API.

@bboreham
Copy link
Contributor Author

/sig network

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label May 26, 2017
@luxas
Copy link
Member

luxas commented May 27, 2017

Related I guess: kubernetes/community#662

@MikeSpreitzer
Copy link
Member

I work on BYOIP and explicit networks, so am no fan of having a configured --cluster-cidr --- I would rather see its usage decrease (eventually to zero) rather than increase. What would it take for Weave Net to flourish in such a setting?

@bboreham
Copy link
Contributor Author

bboreham commented Jun 1, 2017

@MikeSpreitzer it flourishes pretty well 🙂 But there are three independent settings at present, so let me take them one by one:

Weave Net has a CIDR: it allocates IP addresses to pods within that range, and it creates a route for that CIDR inside each pod (the whole Weave network is one hop at L3). As far as I can think, BYOIP would replace the first but not the second - we need to know how to route to other pods.

kube-proxy has a --cluster-cidr option. If it is not set, two things happen, that occasionally impact users:

  1. kube-proxy issues a log message*. Often, people who have a networking problem find this message, conclude this is the reason for their problem, and add the flag. If they do not set it to a value which matches Weave Net's range, things get worse.
  2. kube-proxy causes SNAT on all traffic from a pod implementing a service to a service VIP. This in turn makes it impossible (within the Weave Net design) to implement a NetworkPolicy specific to that source pod.

kube-controller-manager also has a --cluster-cidr argument; we do not care about this one. However it is commonly used in conjunction with --allocate-node-cidrs, which will set up routes which will cause random network disruption if you are using an overlay network.

*: "clusterCIDR not specified, unable to distinguish between internal and external traffic"

@johnbelamaric
Copy link
Member

See related #25533

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2017
@fasaxc
Copy link
Contributor

fasaxc commented Jan 2, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2018
@fasaxc
Copy link
Contributor

fasaxc commented Jan 2, 2018

Still an issue (and affects other networking plugins, such as Calico).

@errordeveloper
Copy link
Member

@fasaxc I'd like to help with this, do you have any thoughts on how this could be solved?

@errordeveloper
Copy link
Member

errordeveloper commented Feb 13, 2018

@thockin said in #25533 (comment):

this is one of the things I want in the per-cluster configmap

I believe he is referring to ComponentConfig, and there has been significant progress on this, and I kube-proxy and kube-controller-manager have both implemented ComponentConfig.

type KubeControllerManagerConfiguration struct {
metav1.TypeMeta
// Controllers is the list of controllers to enable or disable
// '*' means "all enabled by default controllers"
// 'foo' means "enable 'foo'"
// '-foo' means "disable 'foo'"
// first item for a particular name wins
Controllers []string
// port is the port that the controller-manager's http service runs on.
Port int32
// address is the IP address to serve on (set to 0.0.0.0 for all interfaces).
Address string
// useServiceAccountCredentials indicates whether controllers should be run with
// individual service account credentials.
UseServiceAccountCredentials bool
// cloudProvider is the provider for cloud services.
CloudProvider string
// cloudConfigFile is the path to the cloud provider configuration file.
CloudConfigFile string
// externalCloudVolumePlugin specifies the plugin to use when cloudProvider is "external".
// It is currently used by the in repo cloud providers to handle node and volume control in the KCM.
ExternalCloudVolumePlugin string
// run with untagged cloud instances
AllowUntaggedCloud bool
// concurrentEndpointSyncs is the number of endpoint syncing operations
// that will be done concurrently. Larger number = faster endpoint updating,
// but more CPU (and network) load.
ConcurrentEndpointSyncs int32
// concurrentRSSyncs is the number of replica sets that are allowed to sync
// concurrently. Larger number = more responsive replica management, but more
// CPU (and network) load.
ConcurrentRSSyncs int32
// concurrentRCSyncs is the number of replication controllers that are
// allowed to sync concurrently. Larger number = more responsive replica
// management, but more CPU (and network) load.
ConcurrentRCSyncs int32
// concurrentServiceSyncs is the number of services that are
// allowed to sync concurrently. Larger number = more responsive service
// management, but more CPU (and network) load.
ConcurrentServiceSyncs int32
// concurrentResourceQuotaSyncs is the number of resource quotas that are
// allowed to sync concurrently. Larger number = more responsive quota
// management, but more CPU (and network) load.
ConcurrentResourceQuotaSyncs int32
// concurrentDeploymentSyncs is the number of deployment objects that are
// allowed to sync concurrently. Larger number = more responsive deployments,
// but more CPU (and network) load.
ConcurrentDeploymentSyncs int32
// concurrentDaemonSetSyncs is the number of daemonset objects that are
// allowed to sync concurrently. Larger number = more responsive daemonset,
// but more CPU (and network) load.
ConcurrentDaemonSetSyncs int32
// concurrentJobSyncs is the number of job objects that are
// allowed to sync concurrently. Larger number = more responsive jobs,
// but more CPU (and network) load.
ConcurrentJobSyncs int32
// concurrentNamespaceSyncs is the number of namespace objects that are
// allowed to sync concurrently.
ConcurrentNamespaceSyncs int32
// concurrentSATokenSyncs is the number of service account token syncing operations
// that will be done concurrently.
ConcurrentSATokenSyncs int32
// lookupCacheSizeForRC is the size of lookup cache for replication controllers.
// Larger number = more responsive replica management, but more MEM load.
// nodeSyncPeriod is the period for syncing nodes from cloudprovider. Longer
// periods will result in fewer calls to cloud provider, but may delay addition
// of new nodes to cluster.
NodeSyncPeriod metav1.Duration
// routeReconciliationPeriod is the period for reconciling routes created for Nodes by cloud provider..
RouteReconciliationPeriod metav1.Duration
// resourceQuotaSyncPeriod is the period for syncing quota usage status
// in the system.
ResourceQuotaSyncPeriod metav1.Duration
// namespaceSyncPeriod is the period for syncing namespace life-cycle
// updates.
NamespaceSyncPeriod metav1.Duration
// pvClaimBinderSyncPeriod is the period for syncing persistent volumes
// and persistent volume claims.
PVClaimBinderSyncPeriod metav1.Duration
// minResyncPeriod is the resync period in reflectors; will be random between
// minResyncPeriod and 2*minResyncPeriod.
MinResyncPeriod metav1.Duration
// terminatedPodGCThreshold is the number of terminated pods that can exist
// before the terminated pod garbage collector starts deleting terminated pods.
// If <= 0, the terminated pod garbage collector is disabled.
TerminatedPodGCThreshold int32
// horizontalPodAutoscalerSyncPeriod is the period for syncing the number of
// pods in horizontal pod autoscaler.
HorizontalPodAutoscalerSyncPeriod metav1.Duration
// horizontalPodAutoscalerUpscaleForbiddenWindow is a period after which next upscale allowed.
HorizontalPodAutoscalerUpscaleForbiddenWindow metav1.Duration
// horizontalPodAutoscalerDownscaleForbiddenWindow is a period after which next downscale allowed.
HorizontalPodAutoscalerDownscaleForbiddenWindow metav1.Duration
// horizontalPodAutoscalerTolerance is the tolerance for when
// resource usage suggests upscaling/downscaling
HorizontalPodAutoscalerTolerance float64
// deploymentControllerSyncPeriod is the period for syncing the deployments.
DeploymentControllerSyncPeriod metav1.Duration
// podEvictionTimeout is the grace period for deleting pods on failed nodes.
PodEvictionTimeout metav1.Duration
// DEPRECATED: deletingPodsQps is the number of nodes per second on which pods are deleted in
// case of node failure.
DeletingPodsQps float32
// DEPRECATED: deletingPodsBurst is the number of nodes on which pods are bursty deleted in
// case of node failure. For more details look into RateLimiter.
DeletingPodsBurst int32
// nodeMontiorGracePeriod is the amount of time which we allow a running node to be
// unresponsive before marking it unhealthy. Must be N times more than kubelet's
// nodeStatusUpdateFrequency, where N means number of retries allowed for kubelet
// to post node status.
NodeMonitorGracePeriod metav1.Duration
// registerRetryCount is the number of retries for initial node registration.
// Retry interval equals node-sync-period.
RegisterRetryCount int32
// nodeStartupGracePeriod is the amount of time which we allow starting a node to
// be unresponsive before marking it unhealthy.
NodeStartupGracePeriod metav1.Duration
// nodeMonitorPeriod is the period for syncing NodeStatus in NodeController.
NodeMonitorPeriod metav1.Duration
// serviceAccountKeyFile is the filename containing a PEM-encoded private RSA key
// used to sign service account tokens.
ServiceAccountKeyFile string
// clusterSigningCertFile is the filename containing a PEM-encoded
// X509 CA certificate used to issue cluster-scoped certificates
ClusterSigningCertFile string
// clusterSigningCertFile is the filename containing a PEM-encoded
// RSA or ECDSA private key used to issue cluster-scoped certificates
ClusterSigningKeyFile string
// clusterSigningDuration is the length of duration signed certificates
// will be given.
ClusterSigningDuration metav1.Duration
// enableProfiling enables profiling via web interface host:port/debug/pprof/
EnableProfiling bool
// enableContentionProfiling enables lock contention profiling, if enableProfiling is true.
EnableContentionProfiling bool
// clusterName is the instance prefix for the cluster.
ClusterName string
// clusterCIDR is CIDR Range for Pods in cluster.
ClusterCIDR string
// serviceCIDR is CIDR Range for Services in cluster.
ServiceCIDR string
// NodeCIDRMaskSize is the mask size for node cidr in cluster.
NodeCIDRMaskSize int32
// AllocateNodeCIDRs enables CIDRs for Pods to be allocated and, if
// ConfigureCloudRoutes is true, to be set on the cloud provider.
AllocateNodeCIDRs bool
// CIDRAllocatorType determines what kind of pod CIDR allocator will be used.
CIDRAllocatorType string
// configureCloudRoutes enables CIDRs allocated with allocateNodeCIDRs
// to be configured on the cloud provider.
ConfigureCloudRoutes bool
// rootCAFile is the root certificate authority will be included in service
// account's token secret. This must be a valid PEM-encoded CA bundle.
RootCAFile string
// contentType is contentType of requests sent to apiserver.
ContentType string
// kubeAPIQPS is the QPS to use while talking with kubernetes apiserver.
KubeAPIQPS float32
// kubeAPIBurst is the burst to use while talking with kubernetes apiserver.
KubeAPIBurst int32
// leaderElection defines the configuration of leader election client.
LeaderElection LeaderElectionConfiguration
// volumeConfiguration holds configuration for volume related features.
VolumeConfiguration VolumeConfiguration
// How long to wait between starting controller managers
ControllerStartInterval metav1.Duration
// enables the generic garbage collector. MUST be synced with the
// corresponding flag of the kube-apiserver. WARNING: the generic garbage
// collector is an alpha feature.
EnableGarbageCollector bool
// concurrentGCSyncs is the number of garbage collector workers that are
// allowed to sync concurrently.
ConcurrentGCSyncs int32
// gcIgnoredResources is the list of GroupResources that garbage collection should ignore.
GCIgnoredResources []GroupResource
// nodeEvictionRate is the number of nodes per second on which pods are deleted in case of node failure when a zone is healthy
NodeEvictionRate float32
// secondaryNodeEvictionRate is the number of nodes per second on which pods are deleted in case of node failure when a zone is unhealthy
SecondaryNodeEvictionRate float32
// secondaryNodeEvictionRate is implicitly overridden to 0 for clusters smaller than or equal to largeClusterSizeThreshold
LargeClusterSizeThreshold int32
// Zone is treated as unhealthy in nodeEvictionRate and secondaryNodeEvictionRate when at least
// unhealthyZoneThreshold (no less than 3) of Nodes in the zone are NotReady
UnhealthyZoneThreshold float32
// Reconciler runs a periodic loop to reconcile the desired state of the with
// the actual state of the world by triggering attach detach operations.
// This flag enables or disables reconcile. Is false by default, and thus enabled.
DisableAttachDetachReconcilerSync bool
// ReconcilerSyncLoopPeriod is the amount of time the reconciler sync states loop
// wait between successive executions. Is set to 5 sec by default.
ReconcilerSyncLoopPeriod metav1.Duration
// If set to true enables NoExecute Taints and will evict all not-tolerating
// Pod running on Nodes tainted with this kind of Taints.
EnableTaintManager bool
// HorizontalPodAutoscalerUseRESTClients causes the HPA controller to use REST clients
// through the kube-aggregator when enabled, instead of using the legacy metrics client
// through the API server proxy.
HorizontalPodAutoscalerUseRESTClients bool
}

type KubeProxyConfiguration struct {
metav1.TypeMeta
// TODO FeatureGates really should be a map but that requires refactoring all
// components to use config files because local-up-cluster.sh only supports
// the --feature-gates flag right now, which is comma-separated key=value
// pairs.
//
// featureGates is a comma-separated list of key=value pairs that control
// which alpha/beta features are enabled.
FeatureGates string
// bindAddress is the IP address for the proxy server to serve on (set to 0.0.0.0
// for all interfaces)
BindAddress string
// healthzBindAddress is the IP address and port for the health check server to serve on,
// defaulting to 0.0.0.0:10256
HealthzBindAddress string
// metricsBindAddress is the IP address and port for the metrics server to serve on,
// defaulting to 127.0.0.1:10249 (set to 0.0.0.0 for all interfaces)
MetricsBindAddress string
// enableProfiling enables profiling via web interface on /debug/pprof handler.
// Profiling handlers will be handled by metrics server.
EnableProfiling bool
// clusterCIDR is the CIDR range of the pods in the cluster. It is used to
// bridge traffic coming from outside of the cluster. If not provided,
// no off-cluster bridging will be performed.
ClusterCIDR string
// hostnameOverride, if non-empty, will be used as the identity instead of the actual hostname.
HostnameOverride string
// clientConnection specifies the kubeconfig file and client connection settings for the proxy
// server to use when communicating with the apiserver.
ClientConnection ClientConnectionConfiguration
// iptables contains iptables-related configuration options.
IPTables KubeProxyIPTablesConfiguration
// ipvs contains ipvs-related configuration options.
IPVS KubeProxyIPVSConfiguration
// oomScoreAdj is the oom-score-adj value for kube-proxy process. Values must be within
// the range [-1000, 1000]
OOMScoreAdj *int32
// mode specifies which proxy mode to use.
Mode ProxyMode
// portRange is the range of host ports (beginPort-endPort, inclusive) that may be consumed
// in order to proxy service traffic. If unspecified (0-0) then ports will be randomly chosen.
PortRange string
// resourceContainer is the absolute name of the resource-only container to create and run
// the Kube-proxy in (Default: /kube-proxy).
ResourceContainer string
// udpIdleTimeout is how long an idle UDP connection will be kept open (e.g. '250ms', '2s').
// Must be greater than 0. Only applicable for proxyMode=userspace.
UDPIdleTimeout metav1.Duration
// conntrack contains conntrack-related configuration options.
Conntrack KubeProxyConntrackConfiguration
// configSyncPeriod is how often configuration from the apiserver is refreshed. Must be greater
// than 0.
ConfigSyncPeriod metav1.Duration
}

More specifically:

// clusterCIDR is CIDR Range for Pods in cluster.
ClusterCIDR string

// clusterCIDR is the CIDR range of the pods in the cluster. It is used to
// bridge traffic coming from outside of the cluster. If not provided,
// no off-cluster bridging will be performed.
ClusterCIDR string

I believe it should be already possible to fetch read these using the API, but I think it's alpha is needs to be enabled explicitly.
In the case of kubeadm, there is kubeadm-config ConfigMap, but the default value is empty string, so we'd have deal with that and it's very kubeadm-specific.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 14, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 13, 2018
@bboreham
Copy link
Contributor Author

/remove-lifecycle rotten

this is still an issue for users of Weave Net.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 14, 2018
@caseydavenport
Copy link
Member

The ComponentConfig work looks promising and is probably enough to close this issue once it's beta/GA, though it's still somewhat awkward that this value would need to be configured in two locations (both for the controller manager and the kube-proxy).

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2018
@bboreham
Copy link
Contributor Author

/remove-lifecycle stale

@thockin thockin self-assigned this Oct 15, 2020
@johnbelamaric
Copy link
Member

Yes, that's the real ask (a proper IPAM API).

@errordeveloper
Copy link
Member

errordeveloper commented Oct 16, 2020

I think it would good to also ask if this should be part of a broader network configuration API.

Aside from being able to find out what CIDRs are being used for pods and services, I think it would be very useful to be able to tell at least the name of the network in the given cluster. Of course, beyond that it would also be nice to tell if e.g. there are pods in a certain namespace that are responsible for the network. Happy to elaborate on use-cases, if there is no general objection to this It's something I've been thinking of for a while, and it seems reasonable to bring for discussion, if we are to add some APIs.

@thockin
Copy link
Member

thockin commented Oct 16, 2020 via email

@bboreham
Copy link
Contributor Author

There is a KEP "Removing Knowledge of pod cluster CIDR from iptables rules", implemented by #87748 et seq.
I think it can address part of this issue, in that kube-proxy can now do its job without the CIDR.
(Feels like it strengthens the need for "a broader network configuration API" though)

@aojea
Copy link
Member

aojea commented Oct 16, 2020

xref: Node podCIDR should be EOL'ed
#57130

@thockin
Copy link
Member

thockin commented Oct 17, 2020 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 15, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 14, 2021
@caseydavenport
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 10, 2021
@bridgetkromhout
Copy link
Member

Discussed at sig-network meeting on March 18, 2021.

@errordeveloper
Copy link
Member

@bridgetkromhout I can see from the notes that it was discussed, but little info was logged... any chance someone could summarise the SIGs verdict, fi something was indeed decided?

@thockin
Copy link
Member

thockin commented Mar 19, 2021 via email

@bboreham
Copy link
Contributor Author

An optional API would satisfy my original request, so long as, where available, it matches what kube-proxy and/or kube-controller-manager are using.

In my use-case Kubernetes shouldn't be doing IPAM.

@jayunit100
Copy link
Member

jayunit100 commented Mar 26, 2021

/area kube-proxy (just testin)

@jayunit100
Copy link
Member

/remove-area kube-proxy

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2021
@thockin
Copy link
Member

thockin commented Jul 8, 2021

OK. Let's consider kubernetes/enhancements#2593 the PoR fix to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
Development

No branches or pull requests