Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs add Airflow KubernetesPodOperator #4529

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 73 additions & 2 deletions docs/source/deployment/airflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,77 @@

## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster

The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only.
If you want to execute your DAG in an isolated environment on Airflow using a Kubernetes cluster, you can use a combination of the [`kedro-airflow`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow) and [`kedro-docker`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) plugins.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


1. **Package your Kedro project as a Docker container**
[Use the `kedro docker init` and `kedro docker build` commands](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) to containerize your Kedro project.

Check warning on line 250 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L250

[Kedro.ukspelling] In general, use UK English spelling instead of 'containerize'.
Raw output
{"message": "[Kedro.ukspelling] In general, use UK English spelling instead of 'containerize'.", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 250, "column": 138}}}, "severity": "WARNING"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Use the `kedro docker init` and `kedro docker build` commands](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) to containerize your Kedro project.
[Use the `kedro docker init` and `kedro docker build` commands](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) to containerise your Kedro project.


2. **Push the Docker image to a container registry**
Upload the built Docker image to a cloud container registry, such as AWS ECR, Google Container Registry, or Docker Hub.

3. **Generate an Airflow DAG**
Run the following command to generate an Airflow DAG:
```sh
kedro airflow create
```
This will create a DAG file that includes the `KedroOperator()` by default.

4. **Modify the DAG to use `KubernetesPodOperator`**

Check warning on line 262 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L262

[Kedro.toowordy] 'Modify' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'Modify' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 262, "column": 6}}}, "severity": "WARNING"}
To execute each Kedro node in an isolated Kubernetes pod, replace `KedroOperator()` with `KubernetesPodOperator()`, as shown in the example below:

```python
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator

KubernetesPodOperator(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be trivial but I don't see where do I configure the location of the Kubernetes cluster. For example, if I launch a k3d cluster locally, how do I tell KubernetesPodOperator to use it?

task_id=node_name,
name=node_name,
namespace=NAMESPACE,
image=DOCKER_IMAGE,
cmds=["kedro"],
arguments=["run", f"--nodes={node_name}"],
get_logs=True,
is_delete_operator_pod=True, # Cleanup after execution
in_cluster=False,
do_xcom_push=False,
image_pull_policy="Always",
)
```

### Running multiple nodes in a single container

Check warning on line 283 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L283

[Kedro.toowordy] 'multiple' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'multiple' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 283, "column": 13}}}, "severity": "WARNING"}

By default, this approach runs each node in an isolated Docker container. However, to reduce computational overhead, you can choose to run multiple nodes together within the same container. If you opt for this, you must modify the DAG accordingly to adjust task dependencies and execution order.

Check warning on line 285 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L285

[Kedro.toowordy] 'However' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'However' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 285, "column": 75}}}, "severity": "WARNING"}

Check warning on line 285 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L285

[Kedro.toowordy] 'multiple' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'multiple' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 285, "column": 140}}}, "severity": "WARNING"}

Check warning on line 285 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L285

[Kedro.toowordy] 'modify' is too wordy
Raw output
{"message": "[Kedro.toowordy] 'modify' is too wordy", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 285, "column": 221}}}, "severity": "WARNING"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this probably the preferred approach given the user feedback we got? If so, might be helpful to describe how exactly the user can achieve this, or what customisations are needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true! I tried to expand the description with an example - hope it's clearer now


For example, in the [`spaceflights-pandas` tutorial](../tutorial/spaceflights_tutorial.md), if you want to execute the first two nodes together, your DAG may look like this:

```python
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator

with DAG(...) as dag:
tasks = {
"preprocess-companies-and-shuttles": KubernetesPodOperator(
task_id="preprocess-companies-and-shuttles",
name="preprocess-companies-and-shuttles",
namespace=NAMESPACE,
image=DOCKER_IMAGE,
cmds=["kedro"],
arguments=["run", "--nodes=preprocess-companies-node,preprocess-shuttles-node"],
...
),
"create-model-input-table-node": KubernetesPodOperator(...),
...
}

tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"]
tasks["create-model-input-table-node"] >> tasks["split-data-node"]
...
```

In this example, we modified the original DAG generated by the `kedro airflow create` command by replacing `KedroOperator()` with `KubernetesPodOperator()`. Additionally, we merged the first two tasks into a single task named `preprocess-companies-and-shuttles`. This task executes the Docker image running two Kedro nodes: `preprocess-companies-node` and `preprocess-shuttles-node`.

Check warning on line 312 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L312

[Kedro.weaselwords] 'Additionally' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'Additionally' is a weasel word!", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 312, "column": 158}}}, "severity": "WARNING"}

Furthermore, we adjusted the task order at the end of the DAG. Instead of having separate dependencies for the first two tasks, we consolidated them into a single line:

```python
tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"]
```

Consult the [GitHub repository for `kedro-airflow-k8s`](https://github.com/getindata/kedro-airflow-k8s) for further details, or take a look at the [documentation](https://kedro-airflow-k8s.readthedocs.io/).
This ensures that the `create-model-input-table-node` task runs only after `preprocess-companies-and-shuttles` has completed.

Check warning on line 320 in docs/source/deployment/airflow.md

View workflow job for this annotation

GitHub Actions / vale

[vale] docs/source/deployment/airflow.md#L320

[Kedro.weaselwords] 'only' is a weasel word!
Raw output
{"message": "[Kedro.weaselwords] 'only' is a weasel word!", "location": {"path": "docs/source/deployment/airflow.md", "range": {"start": {"line": 320, "column": 65}}}, "severity": "WARNING"}