About this page
This is the reference guide for the configuration knobs on the CDC connection and on destinations that consume CDC events (in particular the MongoDB streaming destination). For each setting we cover four things in the same order: what it does, the default, when to change it, and the tradeoffs and interactions — especially with other settings whose effect depends on this one.
The CDC engine has a lot of options because it has to support many sources, many destinations, and many operational shapes. Most of those options have safe defaults. A handful do not: changing them in the wrong direction is the most common source of CDC support cases. This page lists those knobs first and explains exactly what changes downstream when you flip them.
Etlworks also runs an automated Flow Findings inspector against every flow. When the inspector flags a CDC configuration, the finding text is intentionally short; this page is where the full explanation lives. If you arrived here from a finding, jump to the matching section using the links in the table below.
Quick reference: what each finding means
| Inspector finding | Severity | Jump to |
|---|---|---|
| FROM is empty, set to *, or uses a regex/wildcard | Major (Structure) | FROM — list of tables |
| Add Event Type and Unique Sequence is disabled | Minor (Structure) | Add Event Type and Unique Sequence |
| …and the MongoDB destination performs CDC deletes | Critical (Structure) | Add Event Type and Unique Sequence |
| MongoDB destination has log.delete.operations or log.write.operations | Major (Performance) | Mongo destination: log.delete.operations and log.write.operations |
| Preprocessor script is configured on the CDC source | Minor (Performance) | Preprocessor script |
| Max Rows in File is below 1,000 or above 1,000,000 | Major (Performance) | Max Rows in File and Max File Size in Bytes |
| Max File Size in Bytes is below 1 MB or above 1 GB | Major (Performance) | Max Rows in File and Max File Size in Bytes |
| Location of CDC files is set | Major (Structure) | Location of CDC files |
| Offset File Name or DDL History File Name is set | Minor (Structure) | Offset File Name and DDL History File Name |
| …and the path is not under {app.data}/debezium_data | Major (Structure) | Offset File Name and DDL History File Name |
| Signal Data Collection is configured | Info (Structure) | Signal Data Collection |
| Soft deletes are configured in Extra columns | Info (Structure) | Extra columns and soft deletes |
| …and the MongoDB destination performs CDC deletes | Critical (Structure) | Extra columns and soft deletes |
| One of the transaction-metadata flags is enabled | Minor (Structure) | Transaction metadata |
| Snapshot Max Threads > 1 on a MongoDB CDC source | Major (Structure) | Snapshot Max Threads |
| Snapshot Max Threads > 1 on a non-MongoDB CDC source | Minor (Performance) | Snapshot Max Threads |
| SQL Server “Query all tables for changes” is enabled | Minor (Performance) | SQL Server: Query all tables for changes |
| Number of retries / Retry minutes / Stop after minutes is set | Minor (Structure) | How to stop CDC Stream |
| JDBC Fetch Size is greater than 100,000 | Major (Performance) | JDBC Fetch Size |
Capture scope: what the source emits
FROM — list of tables to capture
What it does. The FROM field of a CDC flow tells the connector which tables (or collections, for MongoDB) to monitor. It accepts an explicit comma-separated list of fully qualified names, a single name, the wildcard *, or a regex pattern.
Default. Empty — you must configure FROM before the flow can run.
When to change. Always configure FROM as an explicit, comma-separated list of fully qualified table names, for example schema.orders, schema.customers, schema.line_items. Add or remove entries when you want to change the capture set.
Tradeoffs and interactions. Wildcard or regex FROM patterns look convenient and work fine during normal operation, but they have two consequences that often surface only after the flow has been running for a while:
- Ad-hoc snapshots cannot target a specific table. The ad-hoc snapshot mechanism (used to bring new tables into the pipeline without re-snapshotting everything, and to reload an existing table) operates on a list of explicit tables. If FROM is a wildcard or regex, the connector does not know in advance which tables are in scope, so it cannot trigger an ad-hoc snapshot for a single one. See Snapshot Management → Ad-hoc Snapshot.
- The capture set can change silently between restarts. If a new table that matches the pattern is created in the source database (or an existing table is renamed), the next time the connector restarts or a disaster recovery event triggers a re-sync, the connector recomputes the capture set against the live catalog. The recomputed set may differ from what was captured before, and the connector may decide to perform a full re-snapshot on the new members — even though, from the operator's point of view, nothing was changed.
The fix is to keep FROM explicit. When you need to add a table, edit FROM and use an ad-hoc snapshot to bring the new table in. The Etlworks Flow Findings inspector flags any non-explicit FROM on a CDC source as a Major structural issue.
Add Event Type and Unique Sequence
What it does. When enabled, the CDC connector adds two columns to every event it emits:
- debezium_cdc_op — a single character indicating the kind of event: c (create / insert), u (update), or d (delete).
- debezium_cdc_timestamp — a unique, monotonically increasing LONG used as a sequence number for that event.
Default. Enabled.
When to change. Almost never. If a downstream system genuinely cannot accept either of these two extra columns and you are willing to give up the behaviors below, you can disable it.
Tradeoffs and interactions. Several downstream behaviors depend on these columns:
- Delete propagation to MongoDB. The MongoDB streaming destination's Perform delete on matching MongoDB document for CDC 'delete' operation option works by reading debezium_cdc_op and applying a delete when the value is d. With Add Event Type disabled there is no debezium_cdc_op column, so the destination cannot distinguish deletes from inserts and updates, and delete events are silently dropped. This combination is flagged by the inspector as Critical.
- Deterministic event ordering. debezium_cdc_timestamp is the tie-breaker when multiple events for the same key arrive in the same batch. Without it, downstream MERGE logic that orders by source-database timestamps has no fallback when those timestamps collide.
- MERGE into cloud data warehouses. The documented warehouse loading recipes assume both columns are present and use debezium_cdc_op in the MERGE clause to map each event to the right destination action.
If you find yourself reaching for the disable switch, look for a way to suppress these columns at the destination instead (for example, by not selecting them in the field mapping). Leave the source enabled.
Snapshot behavior
Snapshot mode
What it does. Tells the connector what to do the first time it starts against a given source, and what to do when it cannot find a usable offset position on restart. Common values:
- ad-hoc initial (default) — perform a full initial snapshot on first start, then stream from the binlog / WAL / changelog. Subsequent ad-hoc snapshots can be triggered for specific tables.
- initial — full initial snapshot on first start, then stream. Ad-hoc snapshots disabled.
- schema_only — capture schema only on first start, then stream from the current position. Existing rows are not loaded.
- never — do not snapshot; always stream from the current position. Existing rows are not loaded and there is no fallback if the offset is lost.
- initial_only — perform the initial snapshot and then stop. Used for one-shot exports.
- when_needed — perform a snapshot only if the connector cannot find a valid offset.
Default. ad-hoc initial.
When to change. Switch to schema_only when the source is too large to snapshot and you only need new changes. Switch to never only when you have an external mechanism for the initial load and you are sure the binlog / WAL retention exceeds any restart window.
Tradeoffs and interactions. Per-mode details, including the exact behavior of ad-hoc initial and the rules around incremental snapshots, are covered in Snapshot Management. Note that the never and schema_only modes will simply skip the initial load — if the destination starts empty, it will not be backfilled.
Snapshot Max Threads
What it does. Sets how many threads the connector uses during the initial snapshot phase. Each thread snapshots one table concurrently with the others.
Default. 1 (sequential).
When to change. On a relational source with many tables and a database that has spare capacity, raising this to 2–4 can cut snapshot time by roughly the thread count. On smaller or more loaded source databases, leave it at 1.
Tradeoffs and interactions.
- On MongoDB sources, parallel snapshot is a beta feature and is not recommended for production use. It can corrupt the snapshot or fail with hard-to-diagnose errors under load. Always use a value of 1 on MongoDB CDC sources. The inspector flags Snapshot Max Threads > 1 on Mongo as a Major structural issue.
- On every other source, raising the value multiplies the number of concurrent connections opened against the source database, multiplies the JVM heap used during snapshot (roughly proportional to thread count when fetch sizes are large — see JDBC Fetch Size), and slightly increases binlog / WAL retention pressure because each thread holds its own consistent read for the duration of its snapshot.
- The setting only affects the snapshot phase. Once the connector switches to streaming, it operates with a single capture thread regardless of this value.
Signal Data Collection
What it does. Names a table (in a database connection) or file (in a file-system connection) that the connector watches for signal records. A signal record can ask the connector to add new tables to the pipeline, reload existing tables, or perform other operations without restarting the flow.
Default. Empty — signals disabled.
When to change. Set Signal Data Collection only when you intend to use one of the signal-driven mechanisms. The exact behavior depends on whether you also set a Signal connection in the flow's Connections tab:
- Signal connection set and Signal Data Collection set — the connector watches the signal table or file in the named connection for ad-hoc snapshot requests. Activity polls happen on a fixed interval; this is a low recurring cost but not free.
- Signal connection not set and Signal Data Collection set — the connector enables an incremental snapshot, which scans the source database itself for changes that match the signal collection. Incremental snapshots require read/write access to the source database (the connector writes the signal markers to track progress) and on PostgreSQL specifically they can cause WAL segments to accumulate until the snapshot completes. On busy production sources this can fill the WAL volume.
Tradeoffs and interactions. Confirm which of the two mechanisms you actually want. If you only need ad-hoc snapshots, set the Signal connection explicitly so the connector takes the safe polling path. For incremental snapshots, read Snapshot Management → Incremental Snapshots first — the warnings there apply.
Performance and I/O
JDBC Fetch Size
What it does. Tells the JDBC driver how many rows to fetch per round trip when reading from the source database during the initial snapshot. The connector hands this value to the underlying driver as the result-set fetch size.
Default. 10,000.
When to change. Lower the value if the source database is slow and you want to reduce the wait between batches. Raise it modestly if the source is fast and you want to reduce JDBC round-trip overhead.
Tradeoffs and interactions. The fetch size directly drives JVM memory consumption during snapshot — each thread holds approximately fetch size × row width bytes in the result-set buffer. On wide tables (many columns, or columns containing large text or binary values) a fetch size above 100,000 can exhaust the integrator heap long before any data is forwarded. The Flow Findings inspector flags fetch sizes > 100,000 as a Major performance issue. Keep the value at or below 100,000 in all cases; the default 10,000 is fine for most workloads.
This setting only affects the snapshot phase. Streaming uses the database's native change feed, not JDBC reads.
Max Rows in File and Max File Size in Bytes
What they do. The CDC connector buffers events to disk (under {app.data}/debezium_data or the destination connection's storage) and rotates the buffer file when either threshold is reached. The downstream Load flow then processes one rotated file at a time.
Defaults. Max Rows in File: 100,000. Max File Size in Bytes: unset (rotation by row count only).
When to change. Lower the row count when downstream load throughput is constrained and you want to shrink the per-file load time. Raise it when downstream load is fast and you want to amortize file-open / load setup cost over more rows.
Tradeoffs and interactions. Both thresholds have a useful range. Outside that range the flow either spends most of its time rotating tiny files or chokes downstream on a single huge file:
- Max Rows in File below 1,000 — the connector spends a disproportionate share of its time on file open, write, close, and notify cycles. CPU time goes up, throughput goes down.
- Max Rows in File above 1,000,000 — individual files become too large to load efficiently. The destination loader may run out of memory or hold a transaction open for far too long.
- Max File Size in Bytes below 1 MB — same problem as the low-row case, just expressed in bytes.
- Max File Size in Bytes above 1 GB — same problem as the high-row case, with a higher likelihood of OOM downstream.
The Flow Findings inspector flags both kinds of out-of-range value as Major performance issues. The default 100,000 rows with unset byte limit (so the byte limit doesn't fire) is a good starting point for most workloads.
Storage: offset, history, and CDC files
Location of CDC files
What it does. Overrides the directory where the connector writes serialized CDC events on disk. Modern CDC flows pick this up from the destination connection in the flow's TO step.
Default. Empty — the connector uses {app.data}/debezium_data/events automatically.
When to change. Almost never. This field is a legacy from earlier CDC flow templates that did not have a dedicated destination connection.
Tradeoffs and interactions. When Location of CDC files is set, the connector writes to that path regardless of what the destination connection in TO says. That decouples the CDC pipeline from the destination flow and can cause the loader to look in one place while the connector writes to another. The Flow Findings inspector flags any non-empty Location of CDC files as a Major structural issue. Clear the field and let the TO connection drive event placement.
Offset File Name and DDL History File Name
What they do. Name the two files the connector uses to remember its position in the source change feed and the captured schema history. The connector reads them on startup to resume from the last committed position and to know how to interpret records emitted under earlier schema versions.
Defaults. Empty — the connector derives both file names from the CDC connection name and places them under {app.data}/debezium_data.
When to change. Override these fields only when the same tokenized CDC connection is shared across multiple flows (in which case each flow needs its own offset and history files) or when the connection name is going to change and you want to keep the file names stable.
Tradeoffs and interactions. Two distinct concerns apply:
- File name collisions. If two flows share a CDC connection and both write to the same default file names, each flow will overwrite the other's offset position. Result: every restart looks like a fresh start, and the same events stream repeatedly. Overriding the file name per flow is the fix.
- Backup coverage. Etlworks automatically backs up everything under {app.data}/debezium_data. Files placed outside that directory are not backed up. If they are lost — the volume is wiped, the file is corrupted, the host is replaced — the CDC stream cannot resume from the last position and a full re-snapshot is required. The Flow Findings inspector flags off-tree paths as a Major structural issue. Move the override path back under {app.data}/debezium_data.
The recommended pattern is to override only the file name, leaving the directory at the default. Example: set Offset File Name to orders_pipeline.offset, not /var/etlworks/state/orders_pipeline.offset.
Event transformation
Preprocessor script
What it does. A JavaScript snippet that runs for every CDC event before it is serialized to disk. The script has access to the event (as a Jackson JsonNode), the CDC key, the event's source metadata, and the transaction id. It can mutate the event in place, return a Boolean to skip the event, return a String to override the CDC key, or return a TypedKeyValue to do both.
Default. Empty — no preprocessing.
When to change. When a per-event transformation cannot be expressed as an Extra column. The most common legitimate cases are conditionally skipping events based on a complex predicate, or rewriting CDC keys in non-trivial ways.
Tradeoffs and interactions. The preprocessor runs once per event in the JavaScript engine. On a high-throughput stream this dominates CPU consumption. For the common cases — adding fixed columns, computing derived values, attaching the source database or table name, generating composite keys — the Extra columns field with built-in cdc_* functions does the same work in native code at a fraction of the cost. Reach for the preprocessor only after you have confirmed Extra columns cannot express what you need.
Extra columns (including soft deletes)
What it does. A comma-separated list of additional columns to append to every CDC event. Each entry is either a plain name (the value comes from a global variable with the same name, or is empty) or a name with a built-in function: column_name=function_name. Built-in functions include:
| Function | Returns |
|---|---|
| cdc_database | The source database name |
| cdc_table | The source table name |
| cdc_op | The event type: c (insert), u (update), or d (delete) |
| cdc_timestamp | The event timestamp |
| cdc_event_seq | A unique sequential id of the event |
| cdc_key | The composite CDC key (database + table) |
| cdc_boolean_soft_delete | true if the event is a delete, false otherwise |
| cdc_long_soft_delete | Delete timestamp as milliseconds since epoch, or 0 for non-deletes |
| cdc_timestamp_soft_delete | Delete timestamp as a formatted string, or null for non-deletes |
Default. Empty.
When to change. Add columns that downstream loaders need: database and table names for multi-source ingestion, sequence ids for ordering, and so on.
Tradeoffs and interactions: soft deletes. The three cdc_*_soft_delete functions implement a soft-delete pattern by rewriting delete events as updates that mark the row with the soft-delete column. This means the destination never receives an actual delete event; rows are kept in the destination forever and marked as deleted. Two consequences:
- This is intentional in some pipelines (audit warehouses, regulated industries that retain deleted records) but it is a frequent source of confusion when users expect physical deletes downstream. Confirm the behavior is what you want.
- Soft deletes and the MongoDB destination's Perform delete option are mutually exclusive. If both are configured, the MongoDB destination receives no delete events to apply, and no documents are ever removed. The Flow Findings inspector flags this combination as a Critical structural issue. Either remove the soft-delete column from Extra columns or disable Perform delete on the MongoDB destination.
For the full pattern and examples see Tips and Tricks for CDC Flows → Implementing Soft Deletes.
Transaction metadata
The CDC connection has three checkboxes that emit transaction-boundary information into the event stream:
- Provide Transaction Metadata — attaches the source transaction id to every event.
- Log Transaction Metadata — writes a separate log record for each transaction, indicating the events it contained.
- Log Transaction Start/End — writes a record at the start and end of each transaction.
Default. All three disabled.
When to change. Enable only when the destination flow has been explicitly built to consume the markers — for example, when downstream code needs to apply a set of events as a single atomic batch, or to reconstruct source transactions in an audit log.
Tradeoffs and interactions. Each flag adds per-event or per-transaction overhead and a corresponding amount of additional disk traffic. Without matching downstream plumbing the markers are written, paid for, and ignored. The Flow Findings inspector flags any of the three being enabled as a Minor structural issue. If you are not sure whether downstream consumes them, the safe default is to disable all three.
Stream lifecycle: how to stop CDC Stream
Three CDC connection fields control when the CDC stream stops running:
- Number of retries before giving up — after this many consecutive errors, stop the stream.
- Retry N minutes before giving up — after errors persist for this long, stop the stream.
- Always stop after N minutes — stop the stream after this many minutes regardless of state.
Defaults. All empty — the stream runs continuously and self-heals through Debezium's internal retry behavior.
When to change. Set any of these only when you have a specific operational reason to stop the stream periodically: for example, to release resources in a quiet window, or to force a restart on a schedule. Even then, the stop is normally driven by the flow's own schedule rather than these fields.
Tradeoffs and interactions. With any of these set, the CDC stream is not continuous — it stops on the configured condition and depends on the flow's schedule to bring it back. The most common symptom is "the flow keeps stopping by itself" and the most common root cause is a value set in one of these fields. The Flow Findings inspector flags non-empty values as a Minor structural issue. Clear all three to make the stream continuous.
Source-specific settings
SQL Server: Query all tables for changes
What it does. When enabled, the SQL Server CDC connector queries every captured table on each polling cycle, regardless of whether SQL Server has recorded any changes for that table. When disabled (the default), the connector consults SQL Server's change tracking and only queries tables that have changes since the last cycle.
Default. Disabled.
When to change. Enable only when you have confirmed, in your specific environment, that SQL Server is failing to report changes through change tracking and the connector is missing events. This situation is rare.
Tradeoffs and interactions. Enabling this option multiplies the per-cycle query load on SQL Server by the number of captured tables. On a workload where most tables receive no updates between cycles (the common case), nearly all of that work is wasted. The Flow Findings inspector flags this option as a Minor performance issue.
Destination-side settings (MongoDB)
Perform delete on matching MongoDB document for CDC 'delete' operation
What it does. On the MongoDB streaming destination, this option tells the destination to delete the matching document when it receives a CDC event of type d (delete). When disabled, delete events are dropped at the destination and the matching documents remain in MongoDB.
Default. Disabled.
When to change. Enable when you want CDC deletes to propagate as physical deletes in MongoDB.
Tradeoffs and interactions. This option only works when the CDC event stream carries the debezium_cdc_op column. That requires:
- Add Event Type and Unique Sequence to be enabled on the source CDC connection. If it is disabled, no events have an op marker and no deletes are applied.
- Soft deletes not to be configured in the source's Extra columns. Soft deletes rewrite deletes as updates, so the destination never sees a CDC delete and has nothing to apply.
The Flow Findings inspector flags both incompatibilities as Critical structural issues. The combination only works in one configuration: source has Add Event Type enabled, source has no soft-delete column, destination has Perform delete enabled.
Other Parameters: log.delete.operations and log.write.operations
What they do. Two MongoDB destination flags, set through Other Parameters, that write a log entry for every document deleted or written. They are intended for debugging visibility into what the destination is doing.
Default. Both unset (no per-document logging).
When to change. Enable temporarily to confirm that the destination is receiving the expected events or applying the expected operations. Disable as soon as the debugging session is over.
Tradeoffs and interactions. At normal traffic, these flags generate a log record for every document the destination touches. On a CDC pipeline that processes thousands of events per second, this floods the Etlworks log with destination chatter, slows down the destination, and can fill the log volume in hours. The Flow Findings inspector flags either flag being true as a Major performance issue. Set them to false (or remove them from Other Parameters) once debugging is finished.
See also
- Change Data Capture (CDC) from transaction log — CDC section landing page.
- Snapshot Management — the full story on initial, ad-hoc, incremental, and partial snapshots.
- CDC Configuration and Monitoring — stop, reset, monitor, recovery, and offset backup.
- Tips and Tricks for CDC Flows — field-level transformations, soft deletes, the CDC key, and event handling.
- Database-Specific CDC Cases — per-source requirements and quirks (MySQL, PostgreSQL, SQL Server, Oracle, DB2, AS/400, MongoDB).