Code for blog How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?
Clone the repo and setup a virtual environment:
git clone https://github.com/josephmachado/modular_code.git
cd modular_code
python -m venv ./env
source env/bin/activate
pip install -r requirements
python setup.py
The ETL can be run as a query or as our modular code at modular_code.py.
duckdb database.db < messy_query.sql
python modular_code.py
We use datacompy
to compare datasets, we can run this using the following command:
python compare_dataset.py
We have tests defined at test_modular_code.py which we can run as shown below:
pytest test_modular_code.py