Overview

This repo provides a solution to stream records from a Kafka Topic to an Azure Databricks Delta Live Tables (DLT).

The py-spark job reads a multi-schema Kafka Topic, gets the required Avro schema from Kafka Schema Registry, and deserialises the records to a Databricks Delta Live Table (DLT).

terraform folder contains all the Terraform needed to provison Azure, Aiven & Databricks infrastructure.
kafka_topic_dr_pipeline is the Databricks DAB (PySpark).
aiven_kafka contains Python scripts to consume/produce to/from the Aiven Kafka topic using Avro for message schema & Faker for the producer.
scripts folder contains scripts for working with Azure Datbricks and deploying the Databricks DAB bundle.

CI/CD (GitHub Actions)

Step 1: IaC

Run the Terraform Apply action to provision all the required Azure, Aiven & Databricks infrastructure.

Step 2: Kafka Producer

Run the Kafka Producer action to write some records to the a user-actions topic. This is a multi-schema topic, so records have different Avro schemas.

Step 3: Databricks Asset Bundle (DAB)

Run the DAB deploy action to deploy a Databricks Asset Bundle (DAB) that contains a Delta Live Table (DLT) notebook that streams the user-actions topic to Databricks.

Step 4: Verify

Verify the pipeline is running from the Databricks console: Note it may take a few minutes for the pipeline to start as it waits for the cluster resources

Navigate to Azure Databricks workspace > Click Launch Workspace > Delta Live Tables > Waiting for active compute resource... > pipline should start running https://portal.azure.com/#@dkirrane/resource/subscriptions/XXXXXXXX/resourceGroups/databricks-kafka-dr-poc-rg/providers/Microsoft.Databricks/workspaces/databricks-kafka-dr-poc-dbw/overview

2.Check the Streaming Table kafka_dr_pipeline > Click user_actions Streaming Table > click Table name > Create compute resource... > kafka_dr_pipeline (table) > Sample Data tab

Step 5: Cleanup

Run the Terraform Destroy action to delete all PoC infrastructure.

Links

GitHub Actions

https://github.com/dkirrane/kafka_topic_dr/actions

Terraform Cloud workspace

https://app.terraform.io/app/dkirrane/workspaces/databricks-kafka-dr-poc/runs

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
.vscode		.vscode
OLD		OLD
aiven_kafka		aiven_kafka
kafka_topic_dr_pipeline		kafka_topic_dr_pipeline
scripts		scripts
terraform		terraform
.gitignore		.gitignore
README.md		README.md
costs_azure.ps1		costs_azure.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

CI/CD (GitHub Actions)

Step 1: IaC

Step 2: Kafka Producer

Step 3: Databricks Asset Bundle (DAB)

Step 4: Verify

Step 5: Cleanup

Links

GitHub Actions

Terraform Cloud workspace

About

Releases

Packages

Languages

dkirrane/kafka-to-databricks-dlt

Folders and files

Latest commit

History

Repository files navigation

Overview

CI/CD (GitHub Actions)

Step 1: IaC

Step 2: Kafka Producer

Step 3: Databricks Asset Bundle (DAB)

Step 4: Verify

Step 5: Cleanup

Links

GitHub Actions

Terraform Cloud workspace

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages