Change data capture (CDC) is an approach to data integration that is based on the identification, capture, and delivery of the changes made to the source database and stored in the database redo log (also called transaction log).
Etlworks supports native log-based change data capture for PostgreSQL, SQL Server, MySQL, Oracle, and MongoDB.
Read about other change replication techniques available in Etlworks Integrator.
Enable change data capture for the source database
Enabling CDC is different for each database. Please use the following tutorials:
Create change data capture flow
Create CDC flow when the target is any database, file or a web service
Step 1. Create a CDC connection for the source database.
Step 2. Create a destination connection.
Step 3. Create a new flow by selecting any of the CDC flows from the gallery.
Step 4. Add a new source-to-destination transformation where:
- the FROM connection is the CDC connection created in step 1.
- the FROM is a fully qualified source table name (test.dbo.patient) or a wildcard table name (test.dbo.*). If the wildcard table name is used the system will perform a CDC change replication for all tables in the source with a name that matches the wildcard pattern.
- When the source is a MongoDB connection enter or select the MongoDB collection name. The wildcard names are not supported for MongoDB.
- the TO is a destination table name. If you have configured a wildcard table name in the FROM, configure the destination name in the TO using either only asterisk character (
*) or any part of the destination name with the asterisk character (
public.*). The system will substitute the asterisk character with the actual source object name (excluding database and schema names). For example, if the actual source name is
test.dbo.patientand the destination is
public.*the system will load data into the
- the TO connection is the destination database connection created in step 2.
Step 5. Click the
MAPPING button and select the Parameters tab. Set Action to one of the following options:
- Record - the system will use the event type stored in the record: "c" for INSERT, "u" for UPDATE and "d" for DELETE. This option requires setting the Lookup fields to the comma-separated list of fields that uniquely identify the record in the target database. Optionally you can enable the Predict Lookup Fields option.
- Record with MERGE - works the same as Record except always executing the native MERGE (UPSERT) on the target database for the "c" and "u" events.
- Record with IfExists - works the same as Record except always checking if the record exists in the target database and then conditionally excusing INSERT or UPDATE. Use this option if the native MERGE is not supported by the target database.
- INSERT - in some cases, it makes sense to always insert unmodified CDC events into some sort of staging table. Then load events from the staging table(s) into the final table using SQL. When this option is enabled the following fields will be automatically added to the staging table:
- debezium_cdc_op - event type, "c" for create, "u" for update and "d" for delete.
- debezium_cdc_timestamp - the event timestamp encoded as EPOCH time in milliseconds.
- To enable the INSERT option:
- 1. Enable the Always generate INSERT for the source CDC connection:
- 2. Set the Action to INSERT.
Create CDC flow when the target is Snowflake
Create CDC flow when the target is Amazon Redshift
Restream from the beginning if Error
Typically, the CDC flow will perform a full load in the first run and will automatically switch to the CDC mode for all consecutive runs.
You can enable option Restream from the beginning if Error when configuring the CDC connection.
If an error occurred and this option is enabled the system will automatically switch to the snapshot mode and attempt to reload the whole table in the next run. Use it with caution.
Use Internal Queue for better reliability
When configuring the CDC connection it is possible to enable the internal queue which will be used to temporarily store all CDC events until they successfully loaded into the target.
To enable or disable the queue open the CDC connection and select one of the available options for the field Use Internal Queue:
- never - disable the queue.
- always - enable the queue for snapshot and incremental pull.
- snapshot only - enable the queue for the snapshot only.
- except snapshot - enable the queue for the incremental pull only.
This option is ignored if Restream from the beginning if Error is enabled.
The sequence diagram below demonstrates how CDC events are getting processed when the internal queue is enabled.