You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+14-135
Original file line number
Diff line number
Diff line change
@@ -24,7 +24,7 @@ brew install jq
24
24
The current implementation was developed on macOS but is intended to work with any platform supported by Docker. In our experience, Linux and macOS are fine. You can run it on native Windows 10 using [WSL](https://docs.microsoft.com/en-us/windows/wsl/about). Unfortunately, Docker on Windows 10 (version 1809) is hamstrung because it relies on Windows File Sharing (CIFS) to establish the volume mounts. Airflow hammers the volume a little harder than CIFS can handle, and you'll see intermittent FileNotFound errors in the volume mount. This may improve in the future. For now, running _whirl_ inside a Linux VM in Hyper-V gives more reliable results.
25
25
26
26
### Airflow Versions
27
-
As of January 2021, Whirl uses Airflow 2.0.0 as the default version. A specific tag was made for Airflow 1.10.x, which can be found [here](https://github.com/godatadriven/whirl/tree/airflow-1.10.x)
27
+
As of January 2021, Whirl uses Airflow 2.x.x as the default version. A specific tag was made for Airflow 1.10.x, which can be found [here](https://github.com/godatadriven/whirl/tree/airflow-1.10.x)
28
28
29
29
## Getting Started
30
30
@@ -68,23 +68,25 @@ Stops the configured environment.
68
68
69
69
If you want to stop all containers from a specific environment you can add the `-e` or `--environment` commandline argument with the name of the environment. This name corresponds with a directory in the `envs` directory.
70
70
71
-
#### Usage in a CI Pipeline_(work in progress)_
71
+
#### Usage in a CI Pipeline
72
72
73
-
We do not currently have a complete example of how to usage _whirl_ as part of a CI pipeline. However the first step in doing this is involves starting while in `ci` mode. This will:
73
+
We run most of the examples from within our own CI (github actions, see for implementation details our [github workflow](.github/workflows/whirl-ci.yml).
74
+
75
+
You are able to run an example in `ci` mode on your local system by useing the `whirl ci` command. This will:
74
76
75
77
- run the Docker containers daemonized in the background;
76
78
- ensure the DAG(s) are unpaused; and
77
79
- wait for the pipeline to either succeed or fail.
78
80
79
81
Upon success the containers will be stopped and exit successfully.
80
82
81
-
At present we don't exit upon failure because it can be useful to be able to inspect the environment to see what happened. In the future we plan to print out the logs of the failed task and cleanup before indicating the pipeline has failed.
83
+
In case of failure (or success if failure is expected)we print out the logs of the failed task and cleanup before indicating the pipeline has failed.
82
84
83
85
#### Configuring Environment Variables
84
86
85
87
Instead of using the environment option each time you run _whirl_, you can also configure your environment in a `.whirl.env` file. This can be in three places. They are applied in order:
86
88
87
-
- A `.whirl.env` file in the root this repository. This can also specify a default environment to be used when starting _whirl_. You do this by setting the `WHIRL_ENVIRONMENT` which references a directory in the [`envs`](./envs) folder. This repository contains an example you can modify. It specifies the default `PYTHON_VERSION` to be used in any environment.
89
+
- A `.whirl.env` file in the root of this repository. This can also specify a default environment to be used when starting _whirl_. You do this by setting the `WHIRL_ENVIRONMENT` which references a directory in the [`envs`](./envs) folder. This repository contains an example you can modify. It specifies the default `PYTHON_VERSION` to be used in any environment.
88
90
- A `.whirl.env` file in your [`envs/{your-env}`](./envs) subdirectory. The environment directory to use can be set by any of the other `.whirl.env` files or specified on the commandline. This is helpful to set environment specific variables. Of course it doesn't make much sense to set the `WHIRL_ENVIRONMENT` here.
89
91
- A `.whirl.env` in your DAG directory to override any environment variables. This can be useful for example to overwrite the (default) `WHIRL_ENVIRONMENT`.
90
92
@@ -128,145 +130,22 @@ This is also a location for installing and configuring extra client libraries th
128
130
129
131
This repository contains some example environments and workflows. The components used might serve as a starting point for your own environment. If you have a good example you'd like to add, please submit a merge request!
130
132
131
-
#### SSH to Localhost
132
-
133
-
The first example environment only involves one component, the Apache Airflow docker container itself. The environment contains one preparation script called `01_enable_local_ssh.sh` which makes it possible in that container to SSH to `localhost`. The script also adds a new connection called `ssh_local` to the Airflow connections. The directory `example/localhost-ssh-example/` contains a single file, the Airflow DAG, so we have to pass the whirl environment as an argument.
134
-
135
-
To run this example:
136
-
137
-
```bash
138
-
$ cd ./examples/localhost-ssh-example
139
-
# Note: here we pass the whirl environment 'local-ssh' as a command-line argument.
140
-
$ whirl -e local-ssh
141
-
```
142
-
143
-
Open your browser to [http://localhost:5000](http://localhost:5000) to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.
133
+
Each example contains it's own README file to explain the specifics of that example.
144
134
145
-
#### Rest API to S3 Storage Example
135
+
#### Generic running of examples
146
136
147
-
In this example we are going to:
137
+
From within the example directory the `whirl` command can be executed.
148
138
149
-
1. Consume a REST API;
150
-
2. Convert the JSON data to Parquet;
151
-
3. Store the result in a S3 bucket.
152
-
153
-
The environment includes containers for:
154
-
155
-
- A S3 server;
156
-
- A MockServer instance
157
-
- The core Airflow component.
158
-
159
-
The environment contains a setup script in the `whirl.setup.d/` folder:
160
-
161
-
-`01_add_connection_api.sh` which:
162
-
163
-
- Adds a S3 connection to Airflow;
164
-
- Installs the `awscli` Python libraries and configures them to connect to the S3 server;
165
-
- Creates a bucket (with a `/etc/hosts` entry to support the [virtual host style method](https://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html)).
166
-
167
-
To run this example:
139
+
To run a example:
168
140
169
141
```bash
170
-
$ cd ./examples/api-to-s3
171
-
$ whirl
142
+
$ cd ./examples/<example-dag-directory>
143
+
# Note: here we pass the whirl environment as a command-line argument. It can also be configured with the WHIRL_ENVIRONMENT variable
144
+
$ whirl -e <environment to use>
172
145
```
173
146
174
147
Open your browser to [http://localhost:5000](http://localhost:5000) to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.
175
148
176
-
This example includes a `.whirl.env` configuration file in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies S3-specific variables. The example folder also contains a `whirl.setup.d/` directory which contains an initialization script (`01_add_connection_api_and_mockdata.sh`). This script is executed in the container after the environment-specific scripts have run and will:
177
-
178
-
- Add a connection to the API endpoint;
179
-
- Add an [expectation](http://www.mock-server.com/mock_server/creating_expectations.html) for the MockServer to know which response needs to be sent for which requested path;
180
-
- Install Pandas and PyArrow to support transforming the JSON into a Parquet file;
181
-
- Create a local directory where the intermediate file is stored before being uploaded to S3.
182
-
183
-
#### SFTPOperator + PythonOperator + MySQL Example
184
-
185
-
This example includes containers for:
186
-
187
-
- A SFTP server;
188
-
- A MySQL instance;
189
-
- The core Airflow component.
190
-
191
-
The environment contains two startup scripts in the `whirl.setup.d/` folder:
192
-
193
-
-`01_prepare_sftp.sh` which adds a SFTP connection to Airflow;
194
-
-`02_prepare_mysql.sh` which adds a MySQL connection to Airflow.
195
-
196
-
To run this example:
197
-
198
-
```bash
199
-
$ cd ./examples/sftp-mysql-example
200
-
$ whirl
201
-
```
202
-
203
-
Open your browser to [http://localhost:5000](http://localhost:5000) to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.
204
-
205
-
The environment to be used is set in the `.whirl.env` in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies how `MOCK_DATA_FOLDER` is set. The DAG folder also contains a `whirl.setup.d/` directory which contains the script `01_cp_mock_data_to_sftp.sh`. This script gets executed in the container after the environment specific scripts have run and will do a couple of things:
206
-
207
-
1. It will rename the file `mocked-data-#ds_nodash#.csv` that is in the `./mock-data/` folder. It will replace `#ds_nodash#` with the same value that Apache Airflow will use when templating `ds_nodash` in the Python files. This means we have a file available for our specific DAG run. (The logic to rename these files is located in `/etc/airflow/functions/date_replacement.sh` in the Airflow container.)
208
-
2. It will copy this file to the SFTP server, where the DAG expects to find it. When the DAG starts it will try to copy that file from the SFTP server to the local filesystem.
209
-
210
-
#### Remote logging for Airflow
211
-
212
-
In this example the dag is not the most important part. This example is all about how to configure airflow to log to S3.
213
-
We have created an environment that spins up an S3 server together with the Airflow one. The environment contains a setup script in the `whirl.setup.d` folder:
214
-
215
-
-`01_add_connection_s3.sh` which:
216
-
- adds an S3 connection to Airflow
217
-
- Installs awscli Python libraries and configures them to connect to the S3 server
218
-
- Creates a bucket (with adding a `/etc/hosts` entry to support the [virtual host style method](https://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html))
219
-
-`02_configue_logging_to_s3.sh` which:
220
-
- exports environment varibles which airflow uses to override the default config. For example: `export AIRFLOW__CORE__REMOTE_LOGGING=True`
221
-
222
-
223
-
To run the corresponding example DAG, perform the following (assuming you have put _whirl_ to your `PATH`)
224
-
225
-
```bash
226
-
$ cd ./examples/logging-to-s3
227
-
$ whirl
228
-
```
229
-
230
-
Open your browser to [http://localhost:5000](http://localhost:5000) to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked success. If you open one of the logs, the first line shows that the log is retrieved from S3.
231
-
232
-
The environment to be used is set in the `.whirl.env` in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies S3 specific variables.
233
-
234
-
235
-
#### Having an external database for Airflow
236
-
237
-
In this example the dag is not the most important part. This example is all about how to configure airflow to use a external database.
238
-
We have created an environment that spins up an postgres database server together with the Airflow one.
239
-
240
-
241
-
To run the corresponding example DAG, perform the following (assuming you have put _whirl_ to your `PATH`)
242
-
243
-
```bash
244
-
$ cd ./examples/external-airflow-db
245
-
$ whirl
246
-
```
247
-
248
-
Open your browser to [http://localhost:5000](http://localhost:5000) to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked success.
249
-
250
-
The environment to be used is set in the `.whirl.env` in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies Postgres specific variables.
251
-
252
-
253
-
#### Testing failure email
254
-
255
-
In this example the dag is set to fail. This example is all about how to configure airflow to use a external smtp server for sending the failure emails.
256
-
We have created an environment that spins up an smtp server together with the Airflow one.
257
-
258
-
259
-
To run the corresponding example DAG, perform the following (assuming you have put _whirl_ to your `PATH`)
260
-
261
-
```bash
262
-
$ cd ./examples/external-smtp-for-failure-emails
263
-
$ whirl
264
-
```
265
-
266
-
Open your browser to [http://localhost:5000](http://localhost:5000) to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked failed.
267
-
Also open your browser at [http://localhost:1080](http://localhost:1080) for the email client where the emails should show up.
268
-
269
-
The environment to be used is set in the `.whirl.env` in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies specific Airflow configuration variables.
The default environment (`api-python-s3`) includes containers for:
10
+
11
+
- A S3 server;
12
+
- A MockServer instance
13
+
- The core Airflow component.
14
+
15
+
The environment contains a setup script in the `whirl.setup.d/` folder:
16
+
17
+
-`01_add_connection_api.sh` which:
18
+
19
+
- Adds a S3 connection to Airflow;
20
+
- Installs the `awscli` Python libraries and configures them to connect to the S3 server;
21
+
- Creates a bucket (with a `/etc/hosts` entry to support the [virtual host style method](https://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html)).
22
+
23
+
It is also possible to use a more complex environment (`api-python-s3-k8s`) that adds a Kubernetes cluster and the use of Airflows KubernetesExecutor to run this example. This environment is explained in depth in the [environment README](../../envs/api-python-s3-k8s/README.md)
24
+
25
+
To run this example with the default environment:
26
+
27
+
```bash
28
+
$ cd ./examples/api-to-s3
29
+
$ whirl
30
+
```
31
+
32
+
To run this example with the k8s based environment:
33
+
34
+
```bash
35
+
$ cd ./examples/api-to-s3
36
+
$ whirl -e api-python-s3-k8s
37
+
```
38
+
39
+
Open your browser to [http://localhost:5000](http://localhost:5000) to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.
40
+
41
+
This example includes a `.whirl.env` configuration file in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies S3-specific variables. The example folder also contains a `whirl.setup.d/` directory which contains an initialization script (`01_add_connection_api_and_mockdata.sh`). This script is executed in the container after the environment-specific scripts have run and will:
42
+
43
+
- Add a connection to the API endpoint;
44
+
- Add an [expectation](http://www.mock-server.com/mock_server/creating_expectations.html) for the MockServer to know which response needs to be sent for which requested path;
45
+
- Install Pandas and PyArrow to support transforming the JSON into a Parquet file;
46
+
- Create a local directory where the intermediate file is stored before being uploaded to S3.
47
+
48
+
The DAG contains 2 tasks which both use the PythonOperator to:
49
+
- call the mock api to retrieve the JSON data and convert it to a local parquet file
50
+
- use the S3Hook to copy the local file to the S3 bucket
In this example the dag is not the most important part. This example is all about how to configure airflow to use a external database.
4
+
We have created an environment (`external-airflow-db`) that spins up an postgres database server together with the Airflow one.
5
+
6
+
7
+
To run the corresponding example DAG, perform the following (assuming you have put _whirl_ to your `PATH`)
8
+
9
+
```bash
10
+
$ cd ./examples/external-airflow-db
11
+
$ whirl
12
+
```
13
+
14
+
Open your browser to [http://localhost:5000](http://localhost:5000) to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked success.
15
+
16
+
The environment to be used is set in the `.whirl.env` in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies Postgres specific variables.
In this example the dag is set to fail. This example is all about how to configure airflow to use a external smtp server for sending the failure emails.
4
+
We have created an environment that spins up an smtp server together with the Airflow one.
5
+
6
+
7
+
To run the corresponding example DAG, perform the following (assuming you have put _whirl_ to your `PATH`)
8
+
9
+
```bash
10
+
$ cd ./examples/external-smtp-for-failure-emails
11
+
$ whirl
12
+
```
13
+
14
+
Open your browser to [http://localhost:5000](http://localhost:5000) to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked failed.
15
+
Also open your browser at [http://localhost:1080](http://localhost:1080) for the email client where the emails should show up.
16
+
17
+
The environment to be used is set in the `.whirl.env` in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies specific Airflow configuration variables.
The directory `example/localhost-ssh-example/` contains the Airflow DAG and the environment to be used is configured inside the `.whirl.env` file. (it uses the `local-ssh` environment)
4
+
5
+
The local-ssh environment only involves one component, the Apache Airflow docker container itself. The environment contains one preparation script called `01_enable_local_ssh.sh` which makes it possible in that container to SSH to `localhost`. The script also adds a new connection called `ssh_local` to the Airflow connections.
6
+
7
+
The DAG has 1 task that simply uses the SSHOperator to copy a file. A succesfull run proves that we are able to use a airflow connection to execute commands through an ssh connection.
8
+
9
+
To run this example:
10
+
11
+
```bash
12
+
$ cd ./examples/localhost-ssh-example
13
+
# Note: here we pass the whirl environment 'local-ssh' as a command-line argument.
14
+
$ whirl -e local-ssh
15
+
```
16
+
17
+
Open your browser to [http://localhost:5000](http://localhost:5000) to access the Airflow UI. Manually enable the DAG and watch the pipeline run to successful completion.
In this example the dag is not the most important part. This example is all about how to configure airflow to log to S3.
4
+
5
+
The environment to be used (`airflow-s3-logging`) is set in the `.whirl.env` in the DAG directory. In the environment folder there is also a `.whirl.env` which specifies S3 specific variables.
6
+
7
+
The docker compose of the environment that spins up an S3 server together with the Airflow one. The environment contains a setup script in the `whirl.setup.d` folder:
8
+
9
+
-`01_add_connection_s3.sh` which:
10
+
- adds an S3 connection to Airflow
11
+
- Installs awscli Python libraries and configures them to connect to the S3 server
12
+
- Creates a bucket (with adding a `/etc/hosts` entry to support the [virtual host style method](https://docs.aws.amazon.com/AmazonS3/latest/dev/VirtualHosting.html))
13
+
-`02_configue_logging_to_s3.sh` which:
14
+
- exports environment varibles which airflow uses to override the default config. For example: `export AIRFLOW__CORE__REMOTE_LOGGING=True`
15
+
16
+
17
+
To run the corresponding example DAG, perform the following (assuming you have put _whirl_ to your `PATH`)
18
+
19
+
```bash
20
+
$ cd ./examples/logging-to-s3
21
+
$ whirl
22
+
```
23
+
24
+
Open your browser to [http://localhost:5000](http://localhost:5000) to see the Airflow UI appear. Manually enable the DAG and see the pipeline get marked success. If you open one of the logs, the first line shows that the log is retrieved from S3.
0 commit comments