Overview
A Change Data Capture (CDC) pipeline streams real-time changes from a CDC-enabled source database and loads them into Databricks. Etlworks ships a built-in CDC engine based on a customized version of Debezium — nothing to install or run separately, the CDC engine is part of the flow execution.
Supported source databases
- MySQL
- SQL Server
- PostgreSQL
- Oracle
- DB2
- MongoDB
- AS400 (IBMi)
When to use this pipeline
Use this pipeline to replicate changes from a CDC-enabled source database into Databricks in near-real time — typically with seconds to single-digit minutes of lag, depending on configured load frequency.
For loads triggered by a schedule rather than by source-database changes, use Any to Databricks with high-watermark change replication. For streams coming from a message queue rather than a database transaction log, use Stream messages from a queue into Databricks.
Flows optimized for Databricks
| Flow | Use when | |
|---|---|---|
| Any to Databricks (Database / File / Queue / Web service / Well-known API) | Extract from any source, optionally transform, and load into Databricks on a schedule. | |
| Bulk load files into Databricks | Files already exist in a stage. No transformation. Auto-generates COPY INTO; supports MERGE. | |
| Stream CDC events into Databricks | You are here | Real-time replication from a CDC-enabled source database. |
| Stream messages from a queue into Databricks | Real-time ingestion from a streaming-capable message queue. |
How does this flow work?
- The CDC engine connects to the source database's transaction log and captures inserts, updates, and deletes in real time.
- Captured events are serialized into staging files (CSV by default) in cloud storage or a Databricks Volume.
- On a configurable interval (default 60 seconds), the flow loads the staged files into Databricks using COPY INTO.
- For each destination table, the load uses MERGE semantics to apply inserts, updates, and deletes — honoring the CDC operation flags from the source.
- The flow is checkpointed throughout. If streaming or loading fails, the pipeline resumes from the last successful position once the issue is resolved.
Single-flow architecture (recommended)
The default and recommended setup is a single flow that handles both streaming and loading. One flow to configure, schedule, and monitor; failures in either streaming or loading pause the same flow, so recovery is one resume action.
Prerequisites
- A CDC-enabled source database. See Database-specific CDC cases for per-database setup.
- A Databricks connection (PAT or OAuth Service Principal). See Get started with Databricks.
- A stage connection — S3, ADLS Gen2, GCS, server storage, or a Databricks Volume.
- An offset-backup connection (typically server storage or cloud storage) for CDC offset and history files.
- (Optional) A signal connection for high-watermark coordination across pipeline stages.
Create and schedule the Databricks CDC flow
Step 1. Create a CDC connection for the source
See Change Data Capture (CDC) from transaction log for the connection setup per source database. The CDC connection points the engine at the source database's transaction log and configures the captured tables.
Step 2. Create the stage connection
The stage holds the CSV files between extraction and load. Use:
- Amazon S3 for AWS-hosted workspaces.
- Azure Storage (ADLS Gen2) for Azure workspaces.
- Google Cloud Storage for GCP workspaces.
- Server storage pointed at a Databricks Volume (/Volumes/<catalog>/<schema>/<volume>/…) or a directory on the Etlworks host.
Step 3. Create the Databricks connection
See Get started with Databricks for connection setup.
Step 4. Create a connection for offsets and history
The CDC engine persists offsets and schema-history files to a backup connection so the pipeline can resume after a restart. Use server storage or cloud storage. The default location is {app.data}/cdc/<flow-name>/.
Step 5. Create the Databricks CDC flow
In Flows, click + and pick Stream CDC events into Databricks.
Step 6. Add the source-to-destination transformation
Set FROM to the CDC connection and the captured table (or a wildcard). Set TO to the Databricks destination — usually catalog.schema.* for multi-table replication.
Step 7. Set the staging connection on the flow
On the flow, set the named connection Stage to the stage connection from step 2. Set the named connection Offset Backup to the connection from step 4.
Step 8. Configure load parameters
Databricks-specific CDC-flow parameters:
| Parameter | What it does |
|---|---|
| Load data into database (ms) | How often the flow loads staged files into Databricks. Default 60000 (1 minute). Lower values reduce lag but increase Databricks query volume. |
| Wait (ms) to let running load finish when stream stops | Grace period to drain a load in progress when the flow stops. Default 180000 (3 minutes). |
| Action | MERGE by default for CDC. Reflects the upsert semantics required to apply inserts, updates, and deletes from the source transaction log. |
| How to MERGE | DELETE / INSERT (default), Native MERGE, or Native MERGE with separate DELETE. The separate-DELETE variant handles soft-delete markers independently of UPSERTs — useful when CDC sources emit explicit delete events. |
| Lookup Fields | The CDC primary key column(s). Required for MERGE. |
| Extra Lookup Fields | Additional columns to include in the MERGE join (for composite keys). |
| Handle explicit CDC updates | Preserve unmapped columns from the source schema during MERGE (default false). |
| Create temp tables with only TEXT columns | Staging Delta table uses all STRING columns; engine casts during COPY INTO (default true). Absorbs schema drift in the CDC stream. |
| Alter target table if source has new columns | Auto-ALTER the destination Delta table when the source schema widens. Required true for production CDC. |
| Databricks Credential | Optional inline storage credential clause for COPY INTO. Leave blank when the workspace has an External Location or managed Volume covering the stage path. |
| Line Separator | Override the CSV line ending sent to FORMAT_OPTIONS. |
Step 9. Schedule the CDC flow
CDC flows run continuously. Schedule the flow with a continuous-execution schedule so it auto-restarts after restarts or failures.
Monitor a running CDC flow
While a CDC flow runs, the streaming progress, current offset, processed events, and pending load batches are all visible from the flow execution view. See CDC Configuration and Monitoring.
Two-flow alternative: separate extract and load
For very high throughput or different scaling requirements on extraction vs. loading, you can split the pipeline into two flows:
- A CDC Extract flow that streams events from the source database to staged CSV files in the stage.
- A Bulk load files into Databricks flow scheduled to run periodically against the same stage.
This decouples streaming throughput from Databricks load frequency at the cost of an extra flow to operate. See the equivalent two-flow pattern documented in Create pipeline to CDC data into Snowflake — the Databricks variant follows the same shape.
Troubleshooting
For database-specific CDC setup issues, see Database-specific CDC cases. For common Databricks load issues, see Common issues when loading data into cloud data warehouses. For CDC tips, see Tips and Tricks for CDC Flows.