Create pipeline to CDC data into Databricks – Etlworks Support

Overview

A Change Data Capture (CDC) pipeline streams real-time changes from a CDC-enabled source database and loads them into Databricks. Etlworks ships a built-in CDC engine based on a customized version of Debezium — nothing to install or run separately, the CDC engine is part of the flow execution.

Supported source databases

MySQL
SQL Server
PostgreSQL
Oracle
DB2
MongoDB
AS400 (IBMi)

When to use this pipeline

Use this pipeline to replicate changes from a CDC-enabled source database into Databricks in near-real time — typically with seconds to single-digit minutes of lag, depending on configured load frequency.

For loads triggered by a schedule rather than by source-database changes, use Any to Databricks with high-watermark change replication. For streams coming from a message queue rather than a database transaction log, use Stream messages from a queue into Databricks.

Flows optimized for Databricks

Flow		Use when
Any to Databricks (Database / File / Queue / Web service / Well-known API)		Extract from any source, optionally transform, and load into Databricks on a schedule.
Bulk load files into Databricks		Files already exist in a stage. No transformation. Auto-generates COPY INTO; supports MERGE.
Stream CDC events into Databricks	You are here	Real-time replication from a CDC-enabled source database.
Stream messages from a queue into Databricks		Real-time ingestion from a streaming-capable message queue.

How does this flow work?

The CDC engine connects to the source database's transaction log and captures inserts, updates, and deletes in real time.
Captured events are serialized into staging files (CSV by default) in cloud storage or a Databricks Volume.
On a configurable interval (default 60 seconds), the flow loads the staged files into Databricks using COPY INTO.
For each destination table, the load uses MERGE semantics to apply inserts, updates, and deletes — honoring the CDC operation flags from the source.
The flow is checkpointed throughout. If streaming or loading fails, the pipeline resumes from the last successful position once the issue is resolved.

Single-flow architecture (recommended)

The default and recommended setup is a single flow that handles both streaming and loading. One flow to configure, schedule, and monitor; failures in either streaming or loading pause the same flow, so recovery is one resume action.

Prerequisites

A CDC-enabled source database. See Database-specific CDC cases for per-database setup.
A Databricks connection (PAT or OAuth Service Principal). See Get started with Databricks.
A stage connection — S3, ADLS Gen2, GCS, server storage, or a Databricks Volume.
An offset-backup connection (typically server storage or cloud storage) for CDC offset and history files.
(Optional) A signal connection for high-watermark coordination across pipeline stages.

Create and schedule the Databricks CDC flow

Step 1. Create a CDC connection for the source

See Change Data Capture (CDC) from transaction log for the connection setup per source database. The CDC connection points the engine at the source database's transaction log and configures the captured tables.

Step 2. Create the stage connection

The stage holds the CSV files between extraction and load. Use:

Amazon S3 for AWS-hosted workspaces.
Azure Storage (ADLS Gen2) for Azure workspaces.
Google Cloud Storage for GCP workspaces.
Server storage pointed at a Databricks Volume (/Volumes/<catalog>/<schema>/<volume>/…) or a directory on the Etlworks host.

Step 3. Create the Databricks connection

See Get started with Databricks for connection setup.

Step 4. Create a connection for offsets and history

The CDC engine persists offsets and schema-history files to a backup connection so the pipeline can resume after a restart. Use server storage or cloud storage. The default location is {app.data}/cdc/<flow-name>/.

Step 5. Create the Databricks CDC flow

In Flows, click + and pick Stream CDC events into Databricks.

Step 6. Add the source-to-destination transformation

Set FROM to the CDC connection and the captured table (or a wildcard). Set TO to the Databricks destination — usually catalog.schema.* for multi-table replication.

Step 7. Set the staging connection on the flow

On the flow, set the named connection Stage to the stage connection from step 2. Set the named connection Offset Backup to the connection from step 4.

Step 8. Configure load parameters

Databricks-specific CDC-flow parameters:

Parameter	What it does
Load data into database (ms)	How often the flow loads staged files into Databricks. Default 60000 (1 minute). Lower values reduce lag but increase Databricks query volume.
Wait (ms) to let running load finish when stream stops	Grace period to drain a load in progress when the flow stops. Default 180000 (3 minutes).
Action	MERGE by default for CDC. Reflects the upsert semantics required to apply inserts, updates, and deletes from the source transaction log.
How to MERGE	DELETE / INSERT (default), Native MERGE, or Native MERGE with separate DELETE. The separate-DELETE variant handles soft-delete markers independently of UPSERTs — useful when CDC sources emit explicit delete events.
Lookup Fields	The CDC primary key column(s). Required for MERGE.
Extra Lookup Fields	Additional columns to include in the MERGE join (for composite keys).
Handle explicit CDC updates	Preserve unmapped columns from the source schema during MERGE (default false).
Create temp tables with only TEXT columns	Staging Delta table uses all STRING columns; engine casts during COPY INTO (default true). Absorbs schema drift in the CDC stream.
Alter target table if source has new columns	Auto-ALTER the destination Delta table when the source schema widens. Required true for production CDC.
Databricks Credential	Optional inline storage credential clause for COPY INTO. Leave blank when the workspace has an External Location or managed Volume covering the stage path.
Line Separator	Override the CSV line ending sent to FORMAT_OPTIONS.

Step 9. Schedule the CDC flow

CDC flows run continuously. Schedule the flow with a continuous-execution schedule so it auto-restarts after restarts or failures.

Monitor a running CDC flow

While a CDC flow runs, the streaming progress, current offset, processed events, and pending load batches are all visible from the flow execution view. See CDC Configuration and Monitoring.

Two-flow alternative: separate extract and load

For very high throughput or different scaling requirements on extraction vs. loading, you can split the pipeline into two flows:

A CDC Extract flow that streams events from the source database to staged CSV files in the stage.
A Bulk load files into Databricks flow scheduled to run periodically against the same stage.

This decouples streaming throughput from Databricks load frequency at the cost of an extra flow to operate. See the equivalent two-flow pattern documented in Create pipeline to CDC data into Snowflake — the Databricks variant follows the same shape.

Troubleshooting

For database-specific CDC setup issues, see Database-specific CDC cases. For common Databricks load issues, see Common issues when loading data into cloud data warehouses. For CDC tips, see Tips and Tricks for CDC Flows.

Articles in this section