Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplifying inter-pod-communication between Job Pods #99497

Closed
alculquicondor opened this issue Feb 26, 2021 · 10 comments
Closed

Simplifying inter-pod-communication between Job Pods #99497

alculquicondor opened this issue Feb 26, 2021 · 10 comments
Labels
area/batch kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@alculquicondor
Copy link
Member

In #97169, kubernetes/enhancements#2214 we are adding Indexed completion mode, in which each pod of a job gets an index as an annotation. This is useful for static work partitioning in parallel jobs.

We see a potential expansion for indexed completion mode (also mentioned in KEP alternatives) in which it becomes easier to reference each Pod from outside or inside the Job. This can be accomplished in 2 ways:

  • Adding the index also as a label. This would allow to add services on top of the Job referencing each index. This could be added in the same "Indexed" completion mode.
  • Having the indexes as part of the pod name. This removes the need for services to reference indexes. This would have to be a new completion mode (like IndexedAndUnique) because it's not backwards compatible. Some downsides:
    • It introduces recreation delays, as a Pod for an index can only be created once the previous instance has been deleted.
    • An extreme case of the above is when a Node completely fails. Today, k8s could take a while to react to a missing node before removing it from the apiserver and consequently removing the orphan pod.
    • it can only be done once "job tracking with finalizers" Job controller should not rely on calculating pod for its status #28486 graduates to beta, to be able to properly account for failures.

I'm leaving this issue open for discussion and feedback from sig-apps members and ecosystem developers of HPC frameworks on k8s

/sig apps
/area batch

@alculquicondor alculquicondor added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 26, 2021
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. area/batch needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 26, 2021
@k8s-ci-robot
Copy link
Contributor

@alculquicondor: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ahg-g
Copy link
Member

ahg-g commented Feb 26, 2021

+1 to having a mode for stable dns pod names.

There are two cases for stable dns names:

  1. MPI jobs (see mpi-operator)
  2. Distributed Tensorflow training jobs (see tf-operator)

While MPI can likely use Statefulset, tf-operator could significantly be simplified if v1.Job support indexed and stable dns names.

A TFJob includes at least two sets of jobs, workers and parameter servers (PS). Each worker and PS has a unique index (worker-0, worker-1 ... worker-n, similarly there is ps-0, ps-1 ... ps-m); a worker will need to communicate with a specific ps server, and a ps server will need to communicate with specific workers (see the following diagram: http://sharkdtu.com/images/tf-ps-worker.png).

tf-operator currently implements this by creating a Service resource fronting every single worker or ps pod that gets created in a TFJob. The reason they don't use StatefulSet is that pods are not allowed to complete (i.e., the only restart policy allowed is Always); while the worker or ps pod could stay idle when it finishes, this is not ideal because it would keep hold of the resources allocated it to, which is expensive especially when a GPU is allocated to the pod. Creating pods and fronting each with a service makes it possible to give pods stable dns names while at the same time allow them to use OnFailure or Never restart policy. Hence tf-operator opted for pods with manual service setup rather than a StatefulSet.

The problem with Services is two fold:

  1. They add extra latency: traffic has to go through iptables rules so that service VIP gets replaced with the actual IP of the pod for every packet sent or received
  2. Scale problems: each Service creates an associated Endpoint resource, moreover, they require programming on each node, hence generating lots of control traffic.

Indexed Job with stable dns name brings the following advantages to TF distributed training:

  1. The tf-operator will not need to manually create pods and services and manage them. All of pod management will be offloaded to the core jobs controller. This not only simplify the tf-operator, but also takes us one step further towards unifying the fragmented kubernetes batch ecosystem around the core k8s Job API.

  2. Compared to the Service-based solution, indexed job with stable dns name offer lower latency and scales better: pods can directly address each other, there is no Service in between (no extra resources created and no iptables programming), just a dns lookup.

@alculquicondor
Copy link
Member Author

cc @soltysh

@alculquicondor
Copy link
Member Author

alculquicondor commented Apr 14, 2021

Proposal (brainstormed with @ahg-g):
Add some features to the Indexed completion mode. Additionally to adding the completion index as a Pod annotation, the job controller would:

  • Set the hostname to $(job-name)-$(index). If the user adds a Service that matches the Job selector and sets the Pods' subdomain to the service name, then the control plane creates DNS records for the hostname. A pod in the job could refer to another pod with the domain $(job-name)-$(index).$(subdomain).
  • (nice to have) have the pod name prefixes as $(job-name)-$(index)- so that users can quickly visualize which Pods correspond to a specific index (without having to look at the annotation). Alternatively, the index can be added as a label, so that it can be used in a label selector.

These changes are backwards compatible with the existing Indexed Job, so they can be added to such completion mode as part of the beta graduation. I'm bringing this up for the next SIG meeting.

@johnbelamaric do you have any thoughts from a DNS perspective?

@johnbelamaric
Copy link
Member

That should work. My only concerns would be 1) the scale; and 2) the DNS programming latency. For very large jobs, with short-lived pods this would create a fair bit of churn and API server traffic on the endpointslice watches to CoreDNS (note that when using a headless service, there are no iptables rules programmed, IIRC, so that shouldn't be an issue). If the pods want to inter-communicate, they must also be prepared to retry the lookup in the event that DNS has yet to be programmed when they come up (particularly if you are bringing up 1000s of pods at a time).

Another option would be to have some new functionality in CoreDNS (or other DNS provider) that watches jobs and their associated pods and serves up DNS for those. In that case:

  1. There is no added churn in the API server since EndpointSlice controller no longer needs to create those.
  2. CoreDNS would still have to watch for the Job pods, so it wouldn't make a different from the CoreDNS scale point of view.
  3. DNS programming latency would be a bit lower (skips the need for EndpointSlice creation)

@ahg-g
Copy link
Member

ahg-g commented Apr 14, 2021

Thanks John, we could start with relying on a headless service, and then explore the option of having a watch on the jobs in the dns service. I think the implementation from the job controller perspective will not change: setting the hostname and subdomain on the job pods.

@alculquicondor
Copy link
Member Author

Thanks @johnbelamaric! What you mention is inline with what we thought. On the flip side, the network programming is opt-in, as the user has to create the service and match the Pods' subdomain. We also expect that tightly coupled parallel Jobs wouldn't be very short lived.

We will call out lookup retries in the KEP and documentation. The only solution that wouldn't require DNS lookups is one that queries the apiserver for obtaining the Pod IPs (which users could do using the indexes, if they wish to do so).

Having the direct support in CoreDNS sounds promising. Glad to know you are open to the idea. We can certainly explore that later.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 13, 2021
@alculquicondor
Copy link
Member Author

We solved this with Indexed Jobs kubernetes/enhancements#2630

/close

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: Closing this issue.

In response to this:

We solved this with Indexed Jobs kubernetes/enhancements#2630

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/batch kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
None yet
Development

No branches or pull requests

5 participants