-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update docs add Airflow KubernetesPodOperator #4529
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -244,6 +244,77 @@ | |||||
|
||||||
## How to run a Kedro pipeline on Apache Airflow using a Kubernetes cluster | ||||||
|
||||||
The `kedro-airflow-k8s` plugin from GetInData | Part of Xebia enables you to run a Kedro pipeline on Airflow with a Kubernetes cluster. The plugin can be used together with `kedro-docker` to prepare a docker image for pipeline execution. At present, the plugin is available for versions of Kedro < 0.18 only. | ||||||
If you want to execute your DAG in an isolated environment on Airflow using a Kubernetes cluster, you can use a combination of the [`kedro-airflow`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-airflow) and [`kedro-docker`](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) plugins. | ||||||
|
||||||
1. **Package your Kedro project as a Docker container** | ||||||
[Use the `kedro docker init` and `kedro docker build` commands](https://github.com/kedro-org/kedro-plugins/tree/main/kedro-docker) to containerize your Kedro project. | ||||||
Check warning on line 250 in docs/source/deployment/airflow.md
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
2. **Push the Docker image to a container registry** | ||||||
Upload the built Docker image to a cloud container registry, such as AWS ECR, Google Container Registry, or Docker Hub. | ||||||
|
||||||
3. **Generate an Airflow DAG** | ||||||
Run the following command to generate an Airflow DAG: | ||||||
```sh | ||||||
kedro airflow create | ||||||
``` | ||||||
This will create a DAG file that includes the `KedroOperator()` by default. | ||||||
|
||||||
4. **Modify the DAG to use `KubernetesPodOperator`** | ||||||
Check warning on line 262 in docs/source/deployment/airflow.md
|
||||||
To execute each Kedro node in an isolated Kubernetes pod, replace `KedroOperator()` with `KubernetesPodOperator()`, as shown in the example below: | ||||||
|
||||||
```python | ||||||
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator | ||||||
|
||||||
KubernetesPodOperator( | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might be trivial but I don't see where do I configure the location of the Kubernetes cluster. For example, if I launch a |
||||||
task_id=node_name, | ||||||
name=node_name, | ||||||
namespace=NAMESPACE, | ||||||
image=DOCKER_IMAGE, | ||||||
cmds=["kedro"], | ||||||
arguments=["run", f"--nodes={node_name}"], | ||||||
get_logs=True, | ||||||
is_delete_operator_pod=True, # Cleanup after execution | ||||||
in_cluster=False, | ||||||
do_xcom_push=False, | ||||||
image_pull_policy="Always", | ||||||
) | ||||||
``` | ||||||
|
||||||
### Running multiple nodes in a single container | ||||||
Check warning on line 283 in docs/source/deployment/airflow.md
|
||||||
|
||||||
By default, this approach runs each node in an isolated Docker container. However, to reduce computational overhead, you can choose to run multiple nodes together within the same container. If you opt for this, you must modify the DAG accordingly to adjust task dependencies and execution order. | ||||||
Check warning on line 285 in docs/source/deployment/airflow.md
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this probably the preferred approach given the user feedback we got? If so, might be helpful to describe how exactly the user can achieve this, or what customisations are needed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's true! I tried to expand the description with an example - hope it's clearer now |
||||||
|
||||||
For example, in the [`spaceflights-pandas` tutorial](../tutorial/spaceflights_tutorial.md), if you want to execute the first two nodes together, your DAG may look like this: | ||||||
|
||||||
```python | ||||||
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator | ||||||
|
||||||
with DAG(...) as dag: | ||||||
tasks = { | ||||||
"preprocess-companies-and-shuttles": KubernetesPodOperator( | ||||||
task_id="preprocess-companies-and-shuttles", | ||||||
name="preprocess-companies-and-shuttles", | ||||||
namespace=NAMESPACE, | ||||||
image=DOCKER_IMAGE, | ||||||
cmds=["kedro"], | ||||||
arguments=["run", "--nodes=preprocess-companies-node,preprocess-shuttles-node"], | ||||||
... | ||||||
), | ||||||
"create-model-input-table-node": KubernetesPodOperator(...), | ||||||
... | ||||||
} | ||||||
|
||||||
tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"] | ||||||
tasks["create-model-input-table-node"] >> tasks["split-data-node"] | ||||||
... | ||||||
``` | ||||||
|
||||||
In this example, we modified the original DAG generated by the `kedro airflow create` command by replacing `KedroOperator()` with `KubernetesPodOperator()`. Additionally, we merged the first two tasks into a single task named `preprocess-companies-and-shuttles`. This task executes the Docker image running two Kedro nodes: `preprocess-companies-node` and `preprocess-shuttles-node`. | ||||||
Check warning on line 312 in docs/source/deployment/airflow.md
|
||||||
|
||||||
Furthermore, we adjusted the task order at the end of the DAG. Instead of having separate dependencies for the first two tasks, we consolidated them into a single line: | ||||||
|
||||||
```python | ||||||
tasks["preprocess-companies-and-shuttles"] >> tasks["create-model-input-table-node"] | ||||||
``` | ||||||
|
||||||
Consult the [GitHub repository for `kedro-airflow-k8s`](https://github.com/getindata/kedro-airflow-k8s) for further details, or take a look at the [documentation](https://kedro-airflow-k8s.readthedocs.io/). | ||||||
This ensures that the `create-model-input-table-node` task runs only after `preprocess-companies-and-shuttles` has completed. | ||||||
Check warning on line 320 in docs/source/deployment/airflow.md
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to the PyPI packages instead (https://pypi.org/project/kedro-airflow/ and https://pypi.org/project/kedro-docker/)