Overview
This article explains how Etlworks handles the initial data snapshot and subsequent data flow control in Change Data Capture (CDC) pipelines. It covers key topics such as parallel and ad-hoc snapshots, the process of adding or reloading specific tables, and advanced snapshot strategies like incremental and partial snapshots. It also introduces the concept of signal collections, which allow dynamic control of snapshots during Flow execution without requiring Flow restarts. Understanding these options allows for greater flexibility, fault tolerance, and performance optimization in long-running CDC Flows.
Full (Initial) Snapshot
Where to configure: CDC connection
-
Ensure Snapshot Mode is set to:
ad-hoc initial
initial
Note: Full snapshots are not resumable. If interrupted, the connector will perform a full snapshot on restart unless configured otherwise.
Parallel Snapshot
Where to configure: CDC connection
By default, Etlworks processes snapshots one table at a time. To improve performance, you can enable parallel snapshots by setting the Snapshot Max Threads property in the CDC connection to a value greater than 1.
This setting allows multiple tables to be snapshotted concurrently during both the initial and ad-hoc snapshots. We recommend setting this value based on available CPU cores (typically: cores × 2).
Ad-hoc Snapshot
Where to configure: CDC connection and CDC flow
Once the initial snapshot is complete, the CDC Flow switches to streaming mode. If needed, you can re-run snapshots of specific tables without resetting the entire pipeline.
To enable ad-hoc snapshots, configure the CDC connection with an appropriate Snapshot Mode (see below). Etlworks will temporarily pause streaming, perform a snapshot of the specified tables, and resume streaming.
Note: Ad-hoc snapshots are not resumable. If interrupted, the connector will perform a full snapshot on restart unless configured otherwise.
Add New Tables
Where to configure: CDC connection and CDC flow
To start capturing changes from new tables:
CDC connection:
-
Ensure Snapshot Mode is set to:
ad-hoc initial
ad-hoc
ad-hoc schema only
Enable Automatically trigger ad-hoc snapshot for new tables.
CDC flow:
Stop the Flow.
Add the new tables to the FROM field. IMPORTANT: Use fully qualified table names; regular expressions are not supported.
Restart the Flow.
The new tables will be snapshotted automatically on restart and added to the CDC stream.
Reload Existing Tables
Where to configure: CDC connection and CDC flow
You can re-snapshot specific tables at runtime using a signal collection. This eliminates the need to stop or restart the Flow.
CDC connection:
-
Set Snapshot Mode to:
ad-hoc initial
ad-hoc
ad-hoc schema only
-
Set Signal Data Collection to either:
Fully qualified table name (if using a signal table).
File name with extension (if using a file-based signal).
CDC flow:
Include the signal connection (database or file) in the Flow’s Connections tab.
IMPORTANT: Use fully qualified table names; regular expressions are not supported.
How it works:
The signal collection contains table names to re-snapshot.
When detected, the snapshot is performed and the signal is removed (rows deleted or file removed).
Restarting the Flow is not required.
Signal format:
One column.
One or more fully qualified table names, comma-separated.
Example:
test.inventory, test.payment, test.customerIncremental Snapshots
Where to configure: CDC connection
Incremental snapshots are resumable, allowing partial progress to be preserved after a failure. They are ideal when working with large tables or when uninterrupted streaming is required.
To configure:
1. Verify permissions: The CDC connection must use a database user with write access.
2. Create the signal table in the source database:
CREATE TABLE db.schema.dbz_signal (
id VARCHAR(64),
type VARCHAR(32),
data VARCHAR(2048)
);3. CDC connection:
- Set Signal Data Collection to the fully qualified name of the signal table.
- Do not set Snapshot Mode to any ad-hoc variant.
- Add the signal table to the list of included tables (in the CDC flow or CDC connection).
4. To trigger a snapshot, run:
INSERT INTO db.schema.dbz_signal
VALUES ('signal-1', 'execute-snapshot', '{"data-collections":
["inventory.dbo.orders", "inventory.dbo.item"]}');The snapshot runs immediately if the Flow is active, or on the next restart if it’s not.
Partial Snapshot
Where to configure: CDC connection
To load a subset of data (e.g., only the last 90 days), you can configure partial snapshots using WHERE clauses.
CDC connection:
1. Set snapshot.select.statement.overrides:
snapshot.select.statement.overrides=test.dbo.products,test.dbo.orders2. Define SELECT statements with WHERE clauses for each table:
snapshot.select.statement.overrides.test.dbo.products=SELECT * FROM test.dbo.products WHERE created_at > DATE_SUB(NOW(), INTERVAL 90 DAY)
snapshot.select.statement.overrides.test.dbo.orders=SELECT * FROM test.dbo.orders WHERE created_at > DATE_SUB(NOW(), INTERVAL 90 DAY)This configuration allows you to reload filtered data across all tables without running a full snapshot.