Skip to content

Latest commit

 

History

History
42 lines (31 loc) · 2.58 KB

README.md

File metadata and controls

42 lines (31 loc) · 2.58 KB

CityBrainClimateData

Introduction

This project develops data pipelines to streamline the process of extracting CESM data from AWS S3, transforming and loading data into in Citybrain.

The data pipeline includes the following steps:

  1. Download data from AWS: Use a bash script to download data from the specified S3 URI.
  2. Save parameters in JSON: Save relevant parameters in a JSON file.
  3. Data transformation: Convert the downloaded data from Zarr to Parquet and perform data transformations.
  4. Create table in Citybrian: Create a table in Citybrian using the Parquet files and the parameters in the JSON file.
  5. Quality Assurance (QA): Download sample data from Citybrian to check if the data pipeline has functioned correctly and assess data quality.

For example, CESM1 data pipeline: plot

How to run the data pipeline

In this project, Apache Airflow DAG is used to orchestrate the workflow.

Before running a data pipeline:

To run a data pipeline (DAG):

  1. In the DAG file, replace the default S3 URI with the S3 URI of the target data. For example, in cesm1/cesm1-dag.py, change 's3://ncar-cesm-lens/atm/daily/cesmLE-RCP85-QBOT.zarr' to 's3://ncar-cesm-lens/atm/daily/cesmLE-20C-FLNS.zarr':

    DAG files (xx_dag_xx.py):

  2. Trigger the DAG execution from Airflow UI or from command line. For example, to trigger CESM1 DAG (cesm1_dag.py): airflow dags trigger cesm1_dag. The DAG will execute the tasks according to the defined task dependencies.

  3. Then monitor the progress and status of each task from Airflow UI or from command line.