Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kedro-airflow: updating support for Kubernetes #652

Open
DimedS opened this issue Apr 15, 2024 · 4 comments
Open

kedro-airflow: updating support for Kubernetes #652

DimedS opened this issue Apr 15, 2024 · 4 comments
Assignees

Comments

@DimedS
Copy link
Contributor

DimedS commented Apr 15, 2024

Description

To facilitate running Kedro Airflow on Kubernetes, the kedro-airflow-k8s plugin was developed. However, it only supports versions of Kedro up to 0.18.0, while the current version is 0.19.4. Consequently, we have moved the recommendation to use this plugin to the end of our airflow deployment documentation. We now need to determine the best approach for using Kedro Airflow on Kubernetes going forward.

@astrojuanlu
Copy link
Member

@Lasica @marrrcin Any thoughts? Are you accepting PRs on getindata/kedro-airflow-k8s?

@marrrcin
Copy link

marrrcin commented Apr 16, 2024

You can use the official one and run on k8s. See https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/

@DimedS
Copy link
Contributor Author

DimedS commented May 17, 2024

As I understand:

If I have a Kubernetes Cluster, I can deploy Airflow there using Helm and customise the deployment with a values.yaml file and a custom Docker image to run my Kedro project's DAG. The process involves:

  • Replacing Memory Datasets with persistent ones or grouping nodes.
  • Setting environment variables.
  • Manually copying my DAG to the Airflow Scheduler Pod and copying my project's config and package files to the Docker folder
  • Creating a custom Dockerfile with my Kedro package installation command.

So technically, I don't need anything special to run Kedro on Airflow deployed on a Kubernetes cluster; it's enough to use a DAG created by the kedro-airflow plugin. However, this setup only allows me to run one Kedro project per Airflow deployment. If I want to run multiple projects in the same Airflow deployment, I can use the KubernetesPodOperator() for each Airflow task (i.e., Kedro node). This will execute each task in an isolated, customised container in a separate Kubernetes Pod, with the KubernetesExecutor dynamically managing all these pods.

However, this approach might be inefficient if there are many Kedro nodes, as it will require deploying many containers. It's better to group nodes to reduce the number of tasks, and thus the number of pods.If I understood correctly, additional functionality in the kedro-airflow plugin to help modify your DAG by inserting the KubernetesPodOperator() and KubernetesExecutor parts would be beneficial.

Do you have the same opinion, @marrrcin? Is using the KubernetesPodOperator() for each task a good solution?

@marrrcin
Copy link

marrrcin commented Jun 5, 2024

Hi,
so the solution I've linked above (https://getindata.com/blog/deploying-kedro-pipelines-gcp-composer-airflow-node-grouping-mlflow/) does exactly that - it either runs N:N <Kedro nodes>:<pod for each node> or with grouping N:M <Kedro nodes>:<pod for each group>.
It also allows to use the same Airflow deployment and run multiple Kedro projects within the same instance with full isolation. Imho that's the best approach here. I would say that the default template should encourage to use KubernetesPodOperator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

3 participants