Setup

Code for blog How to turn a 1000-line messy SQL into a modular, & easy-to-maintain data pipeline?

Setup

Clone the repo and setup a virtual environment:

git clone https://github.com/josephmachado/modular_code.git
cd modular_code
python -m venv ./env
source env/bin/activate
pip install -r requirements
python setup.py

Run ETL

The ETL can be run as a query or as our modular code at modular_code.py.

duckdb database.db < messy_query.sql
python modular_code.py

Compare data

We use datacompy to compare datasets, we can run this using the following command:

python compare_dataset.py

Run tests

We have tests defined at test_modular_code.py which we can run as shown below:

pytest test_modular_code.py

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
compare_dataset.py		compare_dataset.py
create_tables.sql		create_tables.sql
database.db		database.db
messy_query.sql		messy_query.sql
modular_code.py		modular_code.py
requirements.txt		requirements.txt
setup.py		setup.py
test_modular_code.py		test_modular_code.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Run ETL

Compare data

Run tests

About

Languages

josephmachado/modular_code

Folders and files

Latest commit

History

Repository files navigation

Setup

Run ETL

Compare data

Run tests

About

Topics

Resources

Stars

Watchers

Forks

Languages