Comparison of techs to perform ETL
OS Open UPRN https://osdatahub.os.uk/downloads/open/OpenUPRN
full count: 41,011,955 test count: 2,000,000
run upload:
cd data
sh initial_upload.sh
origin: postgres target: target
create target:
createdb target
sudo docker images
chmod +x log_memory.sh
- is a winner for both small and large datasets
- it is not distributed so it might struggle with very large datasets
- it is mostly sql based. Familiar for many but might have limitations.
- handles well memory for both small and large datasets
- not as fast as duckdb
- it is distributed so it can handle very large datasets (Terabytes)
- allows SQL, python and scala
- It also has machine learning and graph theory capabilities
- Very efficient compared to Pandas and for small datasets competes well against spark.
- Very similar to pandas.