This project involves extracting data on drug adverse events from the FDA's open-source data platform, transforming it using Python Pandas, and loading it into Snowflake for further analysis and visualization. The entire ETL process is orchestrated using Apache Airflow and Docker, with data visualization performed using Tableau Desktop.
- Project Overview
- Architecture Diagram
- Tools and Services Used
- API Details
- Data Extraction
- Data Storage
- Data Modeling
- ETL Workflow
- Setup Instructions
- Running the Project
- Data Visualization with Tableau Desktop
- Contribution
This project involves:
- Designing an API to retrieve data from the FDA's open-source data platform.
- Building a Python script to fetch and process the data.
- Storing the data in AWS S3.
- Modeling and storing the transformed data in Snowflake.
- Orchestrating the entire ETL process using Apache Airflow.
- Visualizing the data using Tableau Desktop.
- Python: For scripting and data processing.
- Pandas: For data manipulation and analysis.
- AWS S3: As a data lake for storing raw and transformed data.
- Snowflake: For data modeling and storage.
- Apache Airflow: For orchestrating the ETL workflow.
- Docker: For containerizing the Airflow environment.
- Visual Studio Code: As the development environment.
- Tableau Desktop: For data visualization.
- ODBC Snowflake Driver: For connecting Tableau to Snowflake.
- Base URL: https://api.fda.gov/drug/event.json
- Search Query:
receivedate:[20200101 TO 20201231] AND occurcountry:"US" AND _missing_:companynumb
The API retrieves data reported directly from consumers in the United States in the year 2020 who had adverse reactions to suspected drugs.
- Build a Python script to fetch 80,000 rows using the API. Implement paging as the maximum rows the API can fetch is 1000.
- Use Pandas to extract only the required columns and handle null values.
- Handle missing values and update fields as per the data shown on the FDA page.
- Use AWS S3 as a data lake.
- Create an S3 bucket with different folders for raw and cleaned data.
- Create SQL worksheets in Snowflake for designing database schema.
- Create new schemas, databases, and tables to handle the transformed data from S3.
Create an Airflow DAG script to orchestrate the ETL process with the following tasks:
- Check API Availability: Ensure the API is available.
- Extract Data: Fetch FDA drug adverse events data from the API, load it into S3, and push the S3 path to Xcom for reference.
- Transform Data: Pull the S3 path from Xcom, extract the raw data from S3, transform and normalize it, and load the transformed data into a different S3 folder.
- Create Snowflake Stage: Set up a Snowflake stage for the transformed data.
- Load Data into Snowflake: Copy the transformed data into Snowflake tables using the stage from Task 4.
- Install Astro CLI: A tool provided by Astronomer to help manage and interact with Airflow environments, allowing you to run Airflow locally in Docker containers.
- Install Docker Desktop: Update the
.wslconfig
file to use the maximum CPU and RAM possible (minimum 6 processors and 6GB RAM). - Update Airflow Configurations: Modify the
.env
andrequirements.txt
files as shown in how_to_run.docx file. - Start Docker Desktop: Initialize the Astro environment using Astro commands in Visual Studio Code as detailed in how_to_run.docx file.
- Run Airflow in Docker: Use Astro commands as detailed in how_to_run.docx file.
- Access Airflow: Navigate to http://localhost:8080.
- Navigate to Airflow Connections: Create HTTP, AWS, and Snowflake connections.
- Trigger the Airflow DAG.
Once the data is loaded into Snowflake:
- Install the ODBC Snowflake Driver: Required for Tableau Desktop to connect to Snowflake.
- Create an Extract Connection: Connect Tableau Desktop to Snowflake.
- Data Modeling in Tableau: Create a star schema and define relationships between the cleaned data tables.
- Create Visualizations: Use Tableau Desktop to process and visualize the data.
Checkout Tableau data visualization at 2020_Drug_adverse_events | Tableau Public
Checkout how_to_run.docx file for detailed steps to run this project.
Checkout S3 URI.txt to get URI's for raw and transformed data.
Contributions are welcome! Please open an issue or submit a pull request with any improvements.