Skip to content

josephmachado/modular_code

Repository files navigation

Code for blog How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?

Setup

Clone the repo and setup a virtual environment:

git clone https://github.com/josephmachado/modular_code.git
cd modular_code
python -m venv ./env
source env/bin/activate
pip install -r requirements
python setup.py

Run ETL

The ETL can be run as a query or as our modular code at modular_code.py.

duckdb database.db < messy_query.sql
python modular_code.py

Compare data

We use datacompy to compare datasets, we can run this using the following command:

python compare_dataset.py

Run tests

We have tests defined at test_modular_code.py which we can run as shown below:

pytest test_modular_code.py