This repo provides a solution to stream records from a Kafka Topic to an Azure Databricks Delta Live Tables (DLT).
The py-spark job reads a multi-schema Kafka Topic, gets the required Avro schema from Kafka Schema Registry, and deserialises the records to a Databricks Delta Live Table (DLT).
terraform
folder contains all the Terraform needed to provison Azure, Aiven & Databricks infrastructure.kafka_topic_dr_pipeline
is the Databricks DAB (PySpark).aiven_kafka
contains Python scripts to consume/produce to/from the Aiven Kafka topic using Avro for message schema & Faker for the producer.scripts
folder contains scripts for working with Azure Datbricks and deploying the Databricks DAB bundle.
Run the Terraform Apply
action to provision all the required Azure, Aiven & Databricks infrastructure.
Run the Kafka Producer
action to write some records to the a user-actions
topic.
This is a multi-schema topic, so records have different Avro schemas.
Run the DAB deploy
action to deploy a Databricks Asset Bundle (DAB) that contains a Delta Live Table (DLT) notebook that streams the user-actions
topic to Databricks.
- Verify the pipeline is running from the Databricks console: Note it may take a few minutes for the pipeline to start as it waits for the cluster resources
Navigate to Azure Databricks workspace > Click Launch Workspace
> Delta Live Tables > Waiting for active compute resource... > pipline should start running
https://portal.azure.com/#@dkirrane/resource/subscriptions/XXXXXXXX/resourceGroups/databricks-kafka-dr-poc-rg/providers/Microsoft.Databricks/workspaces/databricks-kafka-dr-poc-dbw/overview
2.Check the Streaming Table
kafka_dr_pipeline > Click user_actions
Streaming Table > click Table name > Create compute resource... > kafka_dr_pipeline (table) > Sample Data tab
Run the Terraform Destroy
action to delete all PoC infrastructure.
https://github.com/dkirrane/kafka_topic_dr/actions
https://app.terraform.io/app/dkirrane/workspaces/databricks-kafka-dr-poc/runs