diff --git a/docs/source/deployment/airflow.md b/docs/source/deployment/airflow.md index 7b2652b23d..53a9903c9b 100644 --- a/docs/source/deployment/airflow.md +++ b/docs/source/deployment/airflow.md @@ -244,6 +244,77 @@ On the next page, set the `Public network (Internet accessible)` option in the ` ## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster -The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only. +If you want to execute your DAG in an isolated environment on Airflow using a Kubernetes cluster, you can use a combination of the [`kedro-airflow`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow) and [`kedro-docker`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) plugins. + +1. **Package your Kedro project as a Docker container** + [Use the `kedro docker init` and `kedro docker build` commands](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) to containerize your Kedro project. + +2. **Push the Docker image to a container registry** + Upload the built Docker image to a cloud container registry, such as AWS ECR, Google Container Registry, or Docker Hub. + +3. **Generate an Airflow DAG** + Run the following command to generate an Airflow DAG: + ```sh + kedro airflow create + ``` + This will create a DAG file that includes the `KedroOperator()` by default. + +4. **Modify the DAG to use `KubernetesPodOperator`** + To execute each Kedro node in an isolated Kubernetes pod, replace `KedroOperator()` with `KubernetesPodOperator()`, as shown in the example below: + + ```python + from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator + + KubernetesPodOperator( + task_id=node_name, + name=node_name, + namespace=NAMESPACE, + image=DOCKER_IMAGE, + cmds=["kedro"], + arguments=["run", f"--nodes={node_name}"], + get_logs=True, + is_delete_operator_pod=True, # Cleanup after execution + in_cluster=False, + do_xcom_push=False, + image_pull_policy="Always", + ) + ``` + +### Running multiple nodes in a single container + +By default, this approach runs each node in an isolated Docker container. However, to reduce computational overhead, you can choose to run multiple nodes together within the same container. If you opt for this, you must modify the DAG accordingly to adjust task dependencies and execution order. + +For example, in the [`spaceflights-pandas` tutorial](../tutorial/spaceflights_tutorial.md), if you want to execute the first two nodes together, your DAG may look like this: + +```python +from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator + +with DAG(...) as dag: + tasks = { + "preprocess-companies-and-shuttles": KubernetesPodOperator( + task_id="preprocess-companies-and-shuttles", + name="preprocess-companies-and-shuttles", + namespace=NAMESPACE, + image=DOCKER_IMAGE, + cmds=["kedro"], + arguments=["run", "--nodes=preprocess-companies-node,preprocess-shuttles-node"], + ... + ), + "create-model-input-table-node": KubernetesPodOperator(...), + ... + } + + tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"] + tasks["create-model-input-table-node"] >> tasks["split-data-node"] + ... +``` + +In this example, we modified the original DAG generated by the `kedro airflow create` command by replacing `KedroOperator()` with `KubernetesPodOperator()`. Additionally, we merged the first two tasks into a single task named `preprocess-companies-and-shuttles`. This task executes the Docker image running two Kedro nodes: `preprocess-companies-node` and `preprocess-shuttles-node`. + +Furthermore, we adjusted the task order at the end of the DAG. Instead of having separate dependencies for the first two tasks, we consolidated them into a single line: + +```python +tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"] +``` -Consult the [GitHub repository for `kedro-airflow-k8s`](https://github.com/getindata/kedro-airflow-k8s) for further details, or take a look at the [documentation](https://kedro-airflow-k8s.readthedocs.io/). +This ensures that the `create-model-input-table-node` task runs only after `preprocess-companies-and-shuttles` has completed.