- Starter
- Business
- Enterprise
- On-Premise
- Add-on
Overview
Change Data Capture (CDC) pipeline allows extracting changes in real-time from the databases which support CDC and loading them into the relational database which supports the bulk load.
Etlworks includes a built-in, deeply integrated CDC engine based on a customized version of Debezium.
There is nothing to install or manage separately—the CDC engine runs as part of the flow execution.
Supported source databases
- MySQL
- SQL Server
- PostgreSQL
- Oracle
- DB2
- MongoDB
- AS400 (IBMI platfroms)
When to use this pipeline
Use the pipeline described in this article to extract data from the CDC-enabled database and load it into the destination database which supports the bulk load in real-time.
How it works
A CDC with Bulk Load pipeline in Etlworks captures changes from a CDC-enabled source database, stages them in cloud or local storage, and then loads them into a destination database using its native bulk load mechanism. This approach provides maximum performance for high-volume pipelines.
You can build this pipeline in two ways, depending on your reliability needs and scalability goals:
Option 1: Single Flow – Stream and Load Together
Use this option if you prefer simplicity and are handling moderate change volumes.
-
A single flow performs two actions:
-
Streams CDC events into a designated staging area (e.g., S3, Azure Blob).
-
Periodically (as frequently as every second) bulk loads data into the destination database.
-
-
The process is fault-tolerant: if a failure occurs, the flow resumes both streaming and loading from the last successful checkpoint.
-
Best for: simple deployments with minimal setup.
-
Caution: if streaming fails, loading stops (and vice versa), since both are tightly coupled.
Option 2: Separate Extract and Load Flows
Use this option if you need higher throughput, greater fault isolation, and scalability.
-
The pipeline consists of:
-
Extract Flow: streams CDC events into the staging area.
-
Load Flow: performs bulk loads from the staging area into the destination.
-
-
These flows operate independently and in parallel, which improves fault tolerance and allows:
-
Isolated error recovery
-
Horizontal scaling with multiple flow instances across nodes
-
-
Best for: large-scale deployments, high-frequency CDC, and environments requiring high availability.
Prerequisites
1. You must enable CDC in the source database:
- Enable CDC for Microsoft SQL Server
- Enable CDC MySQL
- Enable CDC for Oracle
- Enable CDC for PostgreSQL
- Enable CDC for DB2
- Enable CDC for MongoDB
- Enable CDC for AS400 (IBMI platfroms)
2. The destination database must support bulk load operations.
Below are some of the examples for the most commonly used databases:
- SQL Server BULK INSERT statement
- RDS SQL Server BULK INSERT statement using Amazon S3 integration
- Postgres COPY command
- RDS Postgres COPY command using Amazon S3 integration
- MySQL LOAD DATA INFILE statement
- Amazon Aurora MySQL LOAD DATA INFILE statement using Amazon S3 integration
- Oracle Inline External Tables
A pipeline with a single flow
Create and schedule CDC flow with bulk load
CDC flow streams CDC events into the designated stage in the local or cloud storage in real-time and periodically (as often as every second) loads the data into the destination database in parallel with the stream.
Note: There is no need to create a separate Flow for the initial load. The first time it connects to a CDC-enabled source database, it reads a consistent snapshot of all of the included databases and tables. When that snapshot is complete, the Flow continuously reads the changes that were committed to the transaction log and generates the corresponding insert, update, and delete events.
Read more about CDC in Etlworks.
Step 1. Create a CDC Connection for the source.
Read how to create a CDC connection.
Step 2 (optional). Create connection for staging files in a cloud storage
If you are planning to stage files in AWS S3, Azure Storage, or Google Cloud Storage you need to create one of the following connections:
- Amazon S3 - for staging files in S3.
- Google Cloud Storage - for staging files in GC storage.
- Microsoft Azure Storage - for staging files in Azure Blob.
This step is not required if the flow stages files locally.
Step 3. Create a database connection for the destination.
It is recommended to enable the auto-commit for the destination connection.
Step 4. Create a connection for history and offset files.
Read how to create CDC Offset and History connection.
Etlworks CDC connectors store the history of DDL changes for the monitored database in the history file and the current position in the transaction log in the offset file.
Typical CDC extract flow starts by snapshotting the monitored tables (A) or starts from the oldest known position in the transaction (redo) log (B), then proceeds to stream changes in the source database (C). If the Flow is stopped and restarted, it resumes from the last recorded position in the transaction log. The connection created in this step can be used to reset the CDC pipeline and restart the process from scratch.
The connection, by default, points to the directory {app.data}/debezium_data.
Step 5. Create CDC flow.
In Flows click Add flow. Type incdc in Select Flow type. Select Stream CDC abd bulk load CDC events into database.
Step 6. Add source-to-destination transformation
Left to right:
- select the CDC connection
- select tables to monitor in FROM
- select the destination database connection
- select or enter the destination table name in TO. When streaming data from multiple source tables set the destination table using a wildcard template.
Step 7 (optional). Set connection for staging files in the cloud storage.
If you are planning to stage files in AWS S3, Azure Storage, or Google Cloud Storage select Connections tab, select connection created in step 2 and select CSV format.
This step is not required if the flow stages files locally.
Step 8. Configure load parameters
Click the MAPPING button, select the Parameters tab.
If needed modify the following Load parameters:
- Load data into database every (ms): by default, the flow loads data into database every 5 minutes (300000 milliseconds). The load runs in parallel with the CDC stream, which never stops. Decrease this parameter to load data into database more often or increase it to reduce the number of consumed database credits.
- Wait (ms) to let running load finish when CDC stream stops: By default, the flow loads data into database every 5 minutes. The CDC stream and load are running in parallel, so when streaming stops, the flow executes the load last more time to finish loading the remaining data in the queue. It is possible that the load flow is still running when the stream stops. Use this parameter to configure how long the flow should wait before executing the load last time. Clear this parameter to disable the wait. In this case, if the load task is still running, the flow will finish without executing the load one last time. The flow will load the remaining data in the queue on the next run.
- Load into staging table: by default, the Flow will attempt to create and load data into the temporary table. Not all databases support the bulk load into the temp table. When this parameter is enabled (it is disabled by default) the flow will create the staging table instead of temporary. It will automatically drop the staging table on the exit.
-
Bulk Load SQL: the Bulk Load SQL is used to load files in the cloud or file storage into the staging or temporary tables in the destination database. This is a required parameter. The following {tokens} can be used in the Bulk Load SQL:
- {TABLE} - the table to load data into,
- {FILE_TO_LOAD} - the name of the file to load with path and extension,
- {FILE} - the name of the file to load without path,
- {FULL_FILE_NAME} - same as {FILE_TO_LOAD},
- {FILE_NO_EXT} - the name of the file to load without path and extension, {EXT} - the extension of the file to load without '.'
- Example for Azure SQL Server:
-
BULK INSERT {TABLE}
FROM '{FILE_TO_LOAD}'
WITH (
DATA_SOURCE = 'BulkLoadDataSource',
FIELDTERMINATOR = ',',
FORMAT='CSV',
FIRSTROW = 2,
MAXERRORS = 10
) - Action: the action can be MERGE (default) or INSERT. If the action is set to MERGE the flow will INSERT records that do not exist in the destination table, UPDATE existing records, and DELETE records that were deleted in the source table.
- Lookup Fields: MERGEaction requires a list of columns that uniquely identify the record. By default, the flow will attempt to predict the Lookup Fields by checking unique indexes in the source and destination tables, but if there is no unique index in either table it is not guaranteed that the prediction will be 100% accurate. Use this parameter to define the Lookup Fields in the following format: fully.qualified.table1=field1,field2;fully.qualified.table2=field1,field2.
Note: The other parameters are similar or the same as for the flow type Bulk load files into database.
Step 9. Schedule CDC flow.
We recommend using a continuous run Schedule type. The idea is that the Flow runs until it is stopped manually, there is an error, or (if configured) there are no more new CDC events for an extended period of time. It restarts automatically after a configurable number of seconds.
Monitor running CDC flow
Read how to monitor running CDC flow.
A pipeline with independent extract and load flows
CDC extract flow extracts data from a CDC-enabled database and creates CSV files with CDC events in the configured location. These files are loaded into the target database by Load Flow.
Step-by-Step Guide
Step 1. Create and schedule CDC Extract flow
This flow is used to extract CDC events from the source database in read time and create files with events.
Read how to create CDC extract flow.
Step 2. Create and schedule Bulk Load flow
This Flow is used to bulk load files created by the CDC extract flow into the destination database.
Read how to create bulk load flow.
Step 3. Schedule Bulk Load Flow.
Schedule flow to run as often as needed. These are the options:
- Run flow periodically (as often as once a minute)
- Run flow continuously (as often as once a second).