From 7b00673b03385284dc4a174f7b42fc8f62370940 Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Thu, 27 Feb 2025 15:28:32 +0000 Subject: [PATCH 1/2] Update docs add Airflow KubernetesPodOperator Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 46 +++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 3 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 7b2652b23d..0186c2812b 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -244,6 +244,46 @@ On the next page, set the `Public network (Internet accessible)` option in the ` ## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster -The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only. - -Consult the [GitHub repository for `kedro-airflow-k8s`](https://github.com/getindata/kedro-airflow-k8s) for further details, or take a look at the [documentation](https://kedro-airflow-k8s.readthedocs.io/). +If you want to execute your DAG in an isolated environment on Airflow using a Kubernetes cluster, you can use a combination of the [`kedro-airflow`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow) and [`kedro-docker`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) plugins. + +### Steps to Deploy: + +1. **Package Your Kedro Project as a Docker Container** + [Use the `kedro docker init` and `kedro docker build` commands](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) to containerize your Kedro project. + +2. **Push the Docker Image to a Container Registry** + Upload the built Docker image to a cloud container registry, such as AWS ECR, Google Container Registry, or Docker Hub. + +3. **Generate an Airflow DAG** + Run the following command to generate an Airflow DAG: + ```sh + kedro airflow create + ``` + This will create a DAG file that includes the `KedroOperator()` by default. + +4. **Modify the DAG to Use `KubernetesPodOperator`** + To execute each Kedro node in an isolated Kubernetes pod, replace `KedroOperator()` with `KubernetesPodOperator()`, as shown in the example below: + + ```python + from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator + + KubernetesPodOperator( + task_id=node_name, + name=node_name, + namespace=NAMESPACE, + image=DOCKER_IMAGE, + cmds=["kedro"], + arguments=["run", f"--nodes={node_name}"], + get_logs=True, + is_delete_operator_pod=True, # Cleanup after execution + in_cluster=False, + do_xcom_push=False, + image_pull_policy="Always", + ) + ``` + +### Running Multiple Nodes in a Single Container +By default, this approach runs each node in an isolated Docker container. However, to reduce computational overhead, you can choose to run multiple nodes together within the same container. If you opt for this, you must modify the DAG accordingly to adjust task dependencies and execution order. + +### Future Improvements +In upcoming releases, we plan to integrate an option within the `kedro-airflow` plugin that allows users to choose between `KedroOperator` and `KubernetesPodOperator` without manual modifications. Additionally, we aim to provide an automated way to generate a DAG that groups multiple nodes together using namespaces, reducing the number of container executions. From 5f186bb8815fa51202ddae700e5d7930e672da8c Mon Sep 17 00:00:00 2001 From: Dmitry Sorokin Date: Fri, 28 Feb 2025 15:24:42 +0000 Subject: [PATCH 2/2] Address review comments Signed-off-by: Dmitry Sorokin --- docs/source/deployment/airflow.md | 47 +++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 8 deletions(-) diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 0186c2812b..53a9903c9b 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -246,12 +246,10 @@ On the next page, set the `Public network (Internet accessible)` option in the ` If you want to execute your DAG in an isolated environment on Airflow using a Kubernetes cluster, you can use a combination of the [`kedro-airflow`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow) and [`kedro-docker`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) plugins. -### Steps to Deploy: - -1. **Package Your Kedro Project as a Docker Container** +1. **Package your Kedro project as a Docker container** [Use the `kedro docker init` and `kedro docker build` commands](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) to containerize your Kedro project. -2. **Push the Docker Image to a Container Registry** +2. **Push the Docker image to a container registry** Upload the built Docker image to a cloud container registry, such as AWS ECR, Google Container Registry, or Docker Hub. 3. **Generate an Airflow DAG** @@ -261,7 +259,7 @@ If you want to execute your DAG in an isolated environment on Airflow using a Ku ``` This will create a DAG file that includes the `KedroOperator()` by default. -4. **Modify the DAG to Use `KubernetesPodOperator`** +4. **Modify the DAG to use `KubernetesPodOperator`** To execute each Kedro node in an isolated Kubernetes pod, replace `KedroOperator()` with `KubernetesPodOperator()`, as shown in the example below: ```python @@ -282,8 +280,41 @@ If you want to execute your DAG in an isolated environment on Airflow using a Ku ) ``` -### Running Multiple Nodes in a Single Container +### Running multiple nodes in a single container + By default, this approach runs each node in an isolated Docker container. However, to reduce computational overhead, you can choose to run multiple nodes together within the same container. If you opt for this, you must modify the DAG accordingly to adjust task dependencies and execution order. -### Future Improvements -In upcoming releases, we plan to integrate an option within the `kedro-airflow` plugin that allows users to choose between `KedroOperator` and `KubernetesPodOperator` without manual modifications. Additionally, we aim to provide an automated way to generate a DAG that groups multiple nodes together using namespaces, reducing the number of container executions. +For example, in the [`spaceflights-pandas` tutorial](../tutorial/spaceflights_tutorial.md), if you want to execute the first two nodes together, your DAG may look like this: + +```python +from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator + +with DAG(...) as dag: + tasks = { + "preprocess-companies-and-shuttles": KubernetesPodOperator( + task_id="preprocess-companies-and-shuttles", + name="preprocess-companies-and-shuttles", + namespace=NAMESPACE, + image=DOCKER_IMAGE, + cmds=["kedro"], + arguments=["run", "--nodes=preprocess-companies-node,preprocess-shuttles-node"], + ... + ), + "create-model-input-table-node": KubernetesPodOperator(...), + ... + } + + tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"] + tasks["create-model-input-table-node"] >> tasks["split-data-node"] + ... +``` + +In this example, we modified the original DAG generated by the `kedro airflow create` command by replacing `KedroOperator()` with `KubernetesPodOperator()`. Additionally, we merged the first two tasks into a single task named `preprocess-companies-and-shuttles`. This task executes the Docker image running two Kedro nodes: `preprocess-companies-node` and `preprocess-shuttles-node`. + +Furthermore, we adjusted the task order at the end of the DAG. Instead of having separate dependencies for the first two tasks, we consolidated them into a single line: + +```python +tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"] +``` + +This ensures that the `create-model-input-table-node` task runs only after `preprocess-companies-and-shuttles` has completed.