Welcome to my Roadmap repository! This repository showcases a comprehensive collection of projects that document my learning journey in data engineering. Each folder represents a specific area of study, featuring a variety of project types, including mini-projects, guided projects, hobby projects, and industry projects. This roadmap serves as both a learning tracker and a portfolio to highlight my growing skills and expertise.
This repository is structured to reflect my learning path in data engineering. Each project demonstrates practical applications of the concepts I have learned, organized into dedicated files for easy navigation. By showcasing these projects, I aim to provide a clear and structured overview of my technical skills and development.
The Understanding Data Engineering.md
file contains key theoretical concepts and definitions from the DataCamp "Understanding Data Engineering" course. It serves as a reference guide for important topics and terminologies in the field of data engineering.
- Airflow: Open-source workflow management for scheduling data engineering tasks.
- AWS (Amazon Web Services): Amazon's cloud computing services.
- Azure: Microsoft's cloud services.
- Big Data: Management of large and complex datasets characterized by volume, variety, velocity, veracity, and value.
- Cloud Computing: Utilizing remote servers hosted on the internet for data management and processing.
- Database Schema: The logical structure of a database, including its data organization and relationships.
- Data Engineering: The process of designing, constructing, and managing data systems to facilitate analysis.
- Data Ingestion: The process of importing data into a system or database.
- Data Lake: A storage repository that holds large amounts of raw data.
- Data Pipelines: A set of processes for moving and transforming data.
- Data Warehousing: Centralized storage of data from multiple sources for analysis.
- ETL (Extract, Transform, Load): A process that extracts data from one source, transforms it, and loads it into a target system.
- Google Cloud: Cloud services provided by Google.
- NoSQL: Non-relational databases for storing structured, semi-structured, and unstructured data.
- Parallel Processing: The simultaneous use of multiple compute resources to process data.
- Redshift: Amazon's cloud data warehouse service.
- S3: Amazon’s cloud object storage service.
The files in this section (Stored Procedure.sql
, Student Tables and Views.sql
) include projects from my Introduction to SQL coursework, focusing on concepts like:
- Stored Procedures: Demonstrated in the
Stored Procedure.sql
file. - Creating Views: Showcased in the
Student Tables and Views.sql
file.
This section contains five mini-projects and one guided project that apply various intermediate SQL concepts, including:
- Group By, Order By, Aggregation Functions, Joins, and more.
- Analyzing Student's Mental Health: This guided project uses various SQL functions (
GROUP BY
,AVG
,COUNT
) to analyze student data. - Analyze International Debt's Statistics: Focuses on using SQL to summarize and analyze debt statistics using
GROUP BY
,SUM
, and other essential SQL functions. - Exploring London’s Travel Network: A guided project that demonstrates the use of aggregation and filtering functions (
SUM
,GROUP BY
,LIMIT
).
Projects in this section demonstrate practical applications of SQL joins, including:
- Inner Joins, Left Joins, Right Joins, Full Joins, and Cross Joins.
Additional projects cover Set Theory operations (UNION
, INTERSECT
, EXCEPT
) and Subqueries.
These projects focus on relational database concepts, including:
- Data Migration: A project that demonstrates migrating data using
INSERT INTO
andCREATE TABLE
. - Attribute Constraints: Managing data integrity through constraints like
NOT NULL
,UNIQUE
, and foreign keys. - Many-to-Many Relationships: Demonstrating relational schema designs using surrogate keys and junction tables.
- Referential Integrity: Managing referential integrity with
ON UPDATE
andON DELETE
behaviors.
This section covers advanced database design principles, including normalization, schema design, and best practices for creating scalable data systems.
This directory includes sub-folders related to Coding Challenges using Python and SQL
in different platforms such as HackerRank
and LeetCode
. Diversifying my knowledge in different area such as solving problems using algorithms, data-analysis, and database management.
This directory contains a collection of Data Pipeline
scripts developed in Python
, as part of my learning journey in the Python for Data Engineering
course on Coursera
, which I completed through a financial aid opportunity. I plan to continue adding scripts here to build and refine my practical data engineering skills.
This directory contains Python CLI Scripts that I built during fun time and any ideas that I come up with. Featuring also what I have learned throughout my journey learning Python for Data Engineering
Feel free to reach out to me for any questions or opportunities:
- Email: christianbacani581@gmail.com
- LinkedIn: Click Here
- Portfolio: Click Here
This repository serves as a reflection of my learning journey in data engineering. As I continue to learn and grow, I will update this repository with new projects and insights. Thank you for visiting!