- Starter
- Business
- Enterprise
- On-Premise
- Add-on
IMPORTANT: Etlworks connections and flows optimized for Azure Synapse Analytics are fully compatible with Microsoft Fabric Warehouse, although some configuration steps differ slightly.
Overview
Change Data Capture (CDC) pipeline allows extracting changes in real-time from the databases which support CDC and loading them into Synapse Analytics or Microsoft Fabric Warehouse.
Etlworks includes a built-in, deeply integrated CDC engine based on a customized version of Debezium.
There is nothing to install or manage separately—the CDC engine runs as part of the flow execution.
Supported source databases
- MySQL
- SQL Server
- PostgreSQL
- Oracle
- DB2
- MongoDB
- AS400 (IBMI platfroms)
When to use this pipeline
Use the pipeline described in this article to extract data from the CDC-enabled database and load it into Azure Synapse Analytics or Microsoft Fabric Warehousein real time.
Flows optimized for Synapse Analytics
| Flow type | When to use | |
|
Any to Synapse Analytics or Fabric Warehouse:
|
When you need to extract data from any source, transform it and load it into Synapse Analytics or Microsoft Fabric Warehouse. | |
| Bulk load files in Azure Blob into Synapse Analytics or Fabric Warehouse | When you need to bulk-load files that already exist in Azure Blob without applying any transformations. The flow automatically generates the COPY command and MERGEs data into the destination. | |
| Stream CDC events into Synapse Analytics or Fabric Warehouse | You are here | When you need to stream updates from the database which supports Change Data Capture (CDC) into Synapse Analytics or Microsoft Fabric Warehouse in real-time. |
| Stream messages from a queue into Synapse Analytics or Fabric Warehouse | When you need to stream messages from the message queue which supports streaming into Synapse Analytics or Microsoft Fabric Warehouse in real time. |
How it works
A CDC pipeline into Synapse/Fabric Warehouse captures real-time changes from a CDC-enabled source database, stages them in Zrure Blob storage and loads them using Synapse’s native ingestion mechanisms.
You can build the pipeline using two options, depending on your reliability, performance, and scaling needs:
Option 1: Single Flow – Stream and Load Together
Use this option for simplicity and quick setup.
-
A single flow streams CDC events into Azure Blob storage and periodically loads them into target tables using native Synapse bulk load (as frequently as every second).
-
Only one flow to configure, schedule, and monitor.
-
The flow is fully fault-tolerant: if streaming or loading fails, the entire process resumes from the last successful checkpoint once the issue is resolved.
-
Best for: low to moderate data volumes and teams prioritizing simplicity.
-
Consideration: Since streaming and loading are tightly coupled, a failure in one pauses both.
Option 2: Separate Extract and Load Flows
Use this option for maximum throughput, fault isolation, and horizontal scaling.
-
Build two independent flows:
-
Extract Flow: streams CDC events into Azure Blob storage.
-
Load Flow: loads staged data into Synapse using native bulk operations.
-
-
The flows run in parallel and can be deployed across multiple nodes for large-scale workloads.
-
Best for: high-volume, low-latency pipelines and mission-critical systems where stability and performance are essential.
Prerequisites
Common Prerequisites
1. Enable CDC in the source database:
- Enable CDC for Microsoft SQL Server
- Enable CDC MySQL
- Enable CDC for Oracle
- Enable CDC for PostgreSQL
- Enable CDC for DB2
- Enable CDC for MongoDB
- Enable CDC for AS400 (IBMI platfroms)
Prerequisites for Synapse Analytics
1. The Azure Synapse Analytics dedicated SQL pool is up and running.
2. The Azure Storage blob is created.
Prerequisites for Microsoft Fabric Warehouse
1. The Microsoft Fabric Warehouse exists and available from the Internet.
2. The Azure Storage Gen2 account exists, user has read permissions for the specific container and account linked to PowerBI and Fabric workspace.
A pipeline with a single flow
Synapse Analytics CDC flow streams CDC events into the designated Azure Blob in real-time and periodically (as often as every second) loads the data into Synapse Analytics in parallel with the stream.
Note: There is no need to create a separate Flow for the initial load. The first time it connects to a CDC-enabled source database, it reads a consistent snapshot of all of the included databases and tables. When that snapshot is complete, the Flow continuously reads the changes that were committed to the transaction log and generates the corresponding insert, update, and delete events.
Read more about CDC in Etlworks.
Step 1. Create a CDC Connection for the source.
Read how to create a CDC connection.
Step 2. Configure format
Enable the following properties:
- Escape double-quotes
- Convert null to empty
Step 3. Configure storage location and type.
The CDC connection must be configured to create files in the Azure Storage from which the Synapse instance can load data files. Read how to configure the CDC connection to create files in Azure Storage.
Step 3. Create connection for staging files in Azure Blob Storage
This flow loads staged files from Azure Blob Storage. Read how to create Azure Blob Storage connection.
Step 4. Create an Azure Synapse Analytics connection for the destination.
Step 5. Create a connection for history and offset files.
Read how to create CDC Offset and History connection.
Etlworks CDC connectors store the history of DDL changes for the monitored database in the history file and the current position in the transaction log in the offset file.
Typical CDC extract flow starts by snapshotting the monitored tables (A) or starts from the oldest known position in the transaction (redo) log (B), then proceeds to stream changes in the source database (C). If the Flow is stopped and restarted, it resumes from the last recorded position in the transaction log. The connection created in this step can be used to reset the CDC pipeline and restart the process from scratch.
The connection, by default, points to the directory{app.data}/debezium_data.
Step 6. Create Synapse Analytics CDC flow.
In Flows click Add flow. Type in cdcin Select Flow type. Select Stream CDC events into Azure Synapse Analytics.
Step 7. Add source-to-destination transformation
Left to right:
- select the CDC connection created
- select tables to monitor in FROM
- select the Synapse Analytics connection
- and select or enter the Synapse Analytics table name in TO. When streaming data from multiple source tables set the destination table using a wildcard template in the following format:database.schema.prefix_*_suffix, where database.schemais a Synapse database and schema to load data into.
Step 8. Set connection for staging files in Azure Blob Storage.
Select Connections tab, select connection created in step 3 and select CSV format.
Step 9. Configure load parameters
Click theMAPPINGbutton, select theParameterstab.
If needed modify the following Load parameters:
- Load data into Synapse Analytics every (ms): by default, the flow loads data into Synapse every 5 minutes (300000 milliseconds). The load runs in parallel with the CDC stream, which never stops. Decrease this parameter to load data into Synapse more often or increase it to reduce the number of consumed Synapse credits.
- Wait (ms) to let running load finish when CDC stream stops: By default, the flow loads data into Synapse every 5 minutes. The CDC stream and load are running in parallel, so when streaming stops, the flow executes the load last more time to finish loading the remaining data in the queue. It is possible that the load flow is still running when the stream stops. Use this parameter to configure how long the flow should wait before executing the load last time. Clear this parameter to disable the wait. In this case, if the load task is still running, the flow will finish without executing the load one last time. The flow will load the remaining data in the queue on the next run.
- Action: the action can beMERGE(default) orINSERT. If the action is set toMERGEthe flow will INSERT records that do not exist in the destination table, UPDATE existing records, and DELETE records that were deleted in the source table.
- Lookup Fields:MERGEaction requires a list of columns that uniquely identify the record. By default, the flow will attempt to predict the Lookup Fields by checking unique indexes in the source and destination tables, but if there is no unique index in either table it is not guaranteed that the prediction will be 100% accurate. Use this parameter to define the Lookup Fields in the following format:fully.qualified.table1=field1,field2;fully.qualified.table2=field1,field2.
Note: The other parameters are similar or the same as for the flow type Bulk load files in Azure Storage into Synapse Analytics.
Step 10. Schedule Synapse Analytics CDC flow.
We recommend using a continuous run Schedule type. The idea is that the Flow runs until it is stopped manually, there is an error, or (if configured) there are no more new CDC events for an extended period of time. It restarts automatically after a configurable number of seconds.
Monitor running CDC flow
Read how to monitor running CDC flow.
A pipeline with independent extract and load flows
CDC extract flow extracts data from a CDC-enabled database and creates CSV files with CDC events in the configured location. These files are loaded into the target database by Load Flow.
Step-by-Step Guide
Step 1. Create and schedule CDC Extract flow
This flow is used to extract CDC events from the source database in read time and create files with events.
Read how to create CDC extract flow.
Step 2. Create and schedule Bulk Load flow
This Flow is used to bulk load files created by the CDC extract flow into Synapse/Fabric Warehouse.
Read how to create bulk load flow.
Step 3. Schedule Bulk Load Flow.
Schedule flow to run as often as needed. These are the options:
- Run flow periodically (as often as once a minute)
- Run flow continuously (as often as once a second).