Overview
This article provides a collection of advanced configuration options and best practices for working with CDC (Change Data Capture) connections in Etlworks. These tips help you control how data is formatted, enriched, or transformed before being delivered to its final destination.
Unless otherwise noted, all settings discussed here apply to the CDC connection itself, not the flow. The recommendations in this guide are especially relevant when streaming CDC events to files, message queues, or cloud storage.
Handling Null and Empty Values
When exporting change data to files, how you represent null values can impact downstream loads. These settings help control that.
Replace null with empty string
If you want null values to be treated as empty strings:
-
In the CDC connection, enable: Convert null to empty.
This is useful when:
-
The destination does not accept the literal string null.
-
You want empty cells in CSV files instead of "null".
Replace null with a specific string
To write nulls as a custom value (e.g., N/A, missing, unknown):
-
Set Value for null in the CDC connection.
This option overrides all other null-handling settings. It ensures consistency across different destinations, especially in reporting or analytics.
Replace empty string with null
This setting applies only to CDC Flows that stream change events into message queues, NoSQL databases (such as MongoDB), or Web Services (APIs).
By default, Etlworks does not replace empty strings with nulls — empty string values remain unchanged in CDC events. This behavior ensures compatibility with most modern destinations and APIs.
If you need to convert empty strings to null values (primarily for backward compatibility with older Etlworks versions where this behavior was enabled by default), you can enable it manually in Other Parameters in the source CDC connection.
nullifempty=false
Recommendation: Keep the default setting unless a downstream system specifically requires nulls instead of empty strings.
Control quoting behavior for nulls
In some cases, especially when working with databases like Greenplum, quoted nulls (e.g., "null") can cause issues:
-
In the CDC connection, enable: Do not enclose null in double quotes.
Use this if your downstream loader treats "null" as a literal string rather than SQL null.
Boolean Conversion for Compatibility
Some databases or file readers prefer numeric booleans (1/0) instead of true/false.
To convert:
-
In the CDC connection, enable: Convert boolean true/false into numerical 1/0.
This helps with strict database schemas or legacy systems.
End-of-Line Characters in Text Fields
Multi-line strings may cause issues in CSV files if not handled carefully.
By default, Etlworks removes newline characters. To preserve them:
-
In the CDC connection, disable: Remove EOL characters.
Use this only if the destination supports multi-line fields or you’re writing JSON.
File Encoding and Compatibility
When working with files, make sure the encoding is consistent across the pipeline.
Set encoding explicitly
-
Choose encoding in the CDC connection (e.g., UTF-8, ISO-8859-1).
-
Use the same encoding in the CSV or JSON Format used later in the pipeline.
This is especially important when:
-
Loading files into systems that do not default to UTF-8.
-
Handling international characters.
Formatting Dates and Timestamps
In CDC connectors, you can easily format date, time, and timestamp fields extracted from the source database into customized string representations. Under the Formats and Templates settings, specify the desired format for each type of field—date, time, timestamp, and timestamp-with-timezone.
If you leave the format fields empty, values will flow downstream exactly as they are extracted from the database.
You can also explicitly set a timezone and enable conversion of timestamp-with-timezone fields into strings using your chosen format, providing consistent handling of timestamps across your pipelines.
Working with JSON Columns in the Source Database
When the source database contains nested data structures (e.g., documents in MongoDB) or columns with values serialized as JSON (such as JSON columns in MySQL or PostgreSQL), the CDC connector works differently when writing JSON and CSV files.
Behavior When Writing JSON Files
By default connector serializes CDC events as is. Additional options can be enabled or disabled:
-
Flatten CDC events: Flattens CDC events encoded in Extended JSON format. The connector decodes fields like $oid, $numberInt, $date, and transforms arrays to strings.
-
Convert JSON array to comma-separated list: Converts arrays such as [1,2,3] to 1,2,3.
-
Preserve square brackets: If enabled alongside the previous setting, keeps square brackets when converting arrays of objects:
[{"first": "joe"}, {"first": "jane"}] → [{"first": "joe"}, {"first": "jane"}].
Structure of the CDC Event
By default, the connector converts only the after state of the event into JSON. You can modify the structure of the streamed event by including additional parts of the original event:
Step 1. Enable Preserve structure of the CDC event when creating JSON file
Step 2. Configure the structure of the event.
Behavior When Writing CSV Files
If the output format is CSV, JSON columns automatically stringified into valid JSON strings. These strings preserve the original structure and can be loaded into target databases that support JSON columns (e.g., Snowflake or PostgreSQL) with no additional transformation required.
This approach ensures that complex nested data is not lost or flattened and remains usable in downstream systems.
Automatically Closing Files When Schema Changes
When a source table changes structure (e.g., columns added/removed), the header row in the CDC file may become outdated.
To prevent downstream load failures:
-
Enable Close CSV file if header has changed.
This ensures every new schema change starts a new file.
Capturing Source Metadata for Better Loads
If you want to auto-create destination tables with correct data types and primary keys:
-
CDC connection: enable Save metadata
-
CSV Format:
-
Enable Save metadata
-
Disable All fields are strings
-
This lets Etlworks:
-
Extract real types and constraints from the source.
-
Improve automatic mapping and validation.
-
Support features like Predict Lookup Fields.
If you want to treat everything as a string:
-
CDC connection: disable Save metadata
-
CSV Format:
-
Disable Save metadata
-
Enable All fields are strings
-
Respecting NOT NULL Constraints
By default, NOT NULL constraints are ignored during extraction.
To preserve them:
-
Enable Capture NOT NULL in the CDC connection.
Use this if:
-
You’re auto-creating tables and you want to enforce constraints downstream.
Controlling CDC Logging Frequency
By default, CDC connectors use exponential backoff when logging progress: : the time period between logging starts at every 5 seconds and increases exponentially until it reaches a maximum of 1 hour.
To control the maximum delay between log entries:
- In CDC connection → Other parameters:
debezium.max.log.poll.period=10
This example sets the maximum log delay to 10 minutes.
Use this if you need more frequent visibility for monitoring or troubleshooting.
Implementing Soft Deletes
By default, the Load Flow performs standard database operations based on the CDC event type:
-
c for create
-
u for update
-
d for delete
If you want to implement soft deletes instead of physically deleting records, you can configure the CDC Connection to include an extra column in the CDC stream that flags deletions. This is done using one of the special soft delete functions in the Extra columns setting.
Available Soft Delete Functions
-
cdc_boolean_soft_delete
Adds a user-defined boolean column to the CDC stream.
-
Value: true for delete events, null for all others.
-
-
cdc_boolean_soft_delete(true)
Also adds a boolean column, but returns:
-
true for delete events, false for all others.
-
-
cdc_timestamp_soft_delete(yyyy-MM-dd HH:mm:ss.SSS)
Adds a timestamp column.
-
Value: current timestamp (with milliseconds precision) for delete events, null for all others.
-
Soft Delete Examples
Adds column is_deleted with a value true for delete events, null for all others.
is_deleted=cdc_boolean_soft_delete
Adds column is_deleted with a value true for delete events, false for all others.
is_deleted=cdc_boolean_soft_delete(true)
Adds column deleted_at with a value equal to current timestamp (with milliseconds precision) for delete events, null for all others.
deleted_at=cdc_boolean_soft_delete(yyy-MM-dd HH:mm:ss.SSS)
Behavior Change with Soft Delete
When any of the soft delete functions are used, the CDC event type is automatically converted to u (update), regardless of the original type. This ensures the destination record is updated rather than deleted.
Interaction with destinations that perform deletes
Soft deletes rewrite the CDC event so the destination receives an update instead of a delete. That means destinations configured to delete on CDC delete events have nothing to act on when soft deletes are enabled. Two specific interactions to know about:
- MongoDB streaming destination with “Perform delete on matching MongoDB document for CDC 'delete' operation” — This destination option deletes a document when the CDC event type is d. Soft deletes change the event type to u before it reaches the destination, so the destination never sees a delete and no document is ever removed. The two configurations are mutually exclusive. Pick one: either soft-delete in the destination by writing the marker column, or hard-delete by leaving soft deletes off and enabling Perform delete. Etlworks' Flow Findings inspector flags the conflicting combination as a Critical structural issue.
- SQL MERGE flows that branch on cdc_op — Recipes that use the debezium_cdc_op column to decide whether to insert, update, or delete the destination row will see an update for what was originally a delete. The branch that handles deletes will not fire. The downstream behavior is whatever you configured the update branch to do, which may or may not be what you want for a soft-deleted row.
See also CDC settings reference → Extra columns (including soft deletes) for the full dependency map.
When NOT to use soft deletes
Soft deletes are intentional in some pipelines — audit warehouses, regulated industries that retain deleted records, anywhere the destination needs to keep historical context. They are a frequent source of confusion when they are enabled by accident:
- If your goal is for source deletes to delete the corresponding destination row, do not configure any of the cdc_*_soft_delete functions in Extra columns. The standard recipe with debezium_cdc_op and a MERGE-with-delete handles physical deletes natively.
- If you are debugging "my deletes don't propagate" on a CDC flow, check the source CDC connection's Extra columns field first. The presence of any cdc_*_soft_delete entry is the most common root cause.
Understanding the CDC Key Format
Each CDC record has a unique identifier (CDC Key ) that groups related events together. All events with the same CDC key are written to the same output file, topic, table, etc.
Default format:
-
[db]-[table]_cdc_stream
-
[db]_[schema]-[table]_cdc_stream (if schema is present)
Tokens:
-
[db]: database name
-
[schema]: schema name
-
[table]: table name
- [collection] - collection name
Examples:
-
For table orders in schema sales from DB erp →
erp_sales-orders_cdc_stream
-
For table customers in DB crm (no schema) →
crm-customers_cdc_stream
Adding and Transforming CDC Columns
The Extra Columns field on the CDC connection drives both addition and in-place transformation of CDC event columns. It can:
- Add new calculated columns — enrich every CDC row with metadata (operation type, key, timestamp, soft-delete flag, source attributes, hardcoded values) or build a derived key (e.g., a single Mongo _id from a composite source primary key).
- Mutate existing CDC event columns in-place — trim padded CHAR values, normalize case, run regex replacements, all without writing a JavaScript preprocessor.
Where to set it: CDC connection → Extra Columns.
Format
column_name[=function[(parameter)]], column_name[=function[(parameter)]]
Adding calculated columns
The classic use case — add metadata columns alongside every CDC row.
Available functions:
- cdc_op: the CDC event type — c for create, u for update, d for delete.
- cdc_key: the CDC key in the user-defined format.
- cdc_timestamp(yyyy-MM-dd HH:mm:ss.SSS): the timestamp of the CDC event. Not guaranteed to be unique across consecutive events.
- cdc_event_seq: the unique sequence number of the event. Grows monotonically.
- cdc_boolean_soft_delete: a boolean — true for delete, null for all other events.
- cdc_boolean_soft_delete(true): a boolean — true for delete, false for all other events.
- cdc_timestamp_soft_delete(yyyy-MM-dd HH:mm:ss.SSS): a timestamp — event timestamp for delete, null otherwise.
-
Data conversion functions — set the data type of the column to one of:
- integer, numeric, long, bit, double, float, varchar, boolean, timestamp, time, date. Example: created_at=timestamp(ts_ms).
- Any hardcoded value, including null. Example: created_at=null. Combine with a data-conversion function as needed.
-
Any attribute of the CDC event's source node. Each CDC connector exposes its own attribute set. Example: created_at=ts_ms. Attributes available for all connectors:
- ts_ms: source-database event timestamp in milliseconds.
- snapshot: is the connector currently performing the initial snapshot.
- name: name of the connector.
- version: version of the connector.
- connector: connector type.
- db: database name.
- table or collection: source table or collection.
- schema: schema name (when applicable).
Example
op=cdc_op, key=cdc_key, event_seq=cdc_event_seq, created_at=cdc_timestamp(yyyy-MM-dd HH:mm:ss.SSS), is_deleted=cdc_boolean_soft_delete(true), region=us-east
The last line shows a hardcoded value — region is added to every CDC row with the literal value us-east.
Building composite keys with composite_key()
Many CDC destinations — MongoDB especially — expect a single string identifier per record, but the source database's primary key may be composite (multiple columns) or named differently from table to table. composite_key() builds a deterministic key value from the source primary-key columns, with sensible defaults driven by the Debezium primary-key metadata.
The most common use is generating a MongoDB _id for a CDC-to-MongoDB flow:
mongoId=composite_key()
When the upstream CDC source exposes primary-key metadata (Debezium does this by default), composite_key() uses those columns automatically and joins them with a default delimiter of -.
Examples
Build the key from the source table's PK metadata (Debezium-provided).
mongoId=composite_key()
Same as above, explicitly setting the delimiter (default is already -).
mongoId=composite_key(delimiter='-')
Force a specific set of key columns across all tables.
mongoId=composite_key(fields='warehouse_id,order_no,line_no')
Per-table override — orders uses order_id; order_lines uses (order_id, line_no); other tables fall back to PK metadata.
mongoId=composite_key(overrides='orders:order_id|order_lines:order_id,line_no')
When PK metadata is unavailable, fall back to the standard cdc_key.
mongoId=composite_key(fallback='cdc_key')
Parameters
| Parameter | What it does |
|---|---|
| fields / keys / keyFields | Explicit comma-separated key columns to use for every table. Bypasses PK-metadata auto-detection. |
| overrides / override | Per-table key columns. Format: table1:col_a,col_b|table2:col_c. Tables not listed fall back to PK metadata. |
| delimiter | Separator between key parts. Default: -. |
| fallback | What to do when no key columns can be resolved: fail (error), cdc_key (use the standard CDC key), event_seq (use the event sequence number), empty (emit empty string), or a literal value. |
| missing | Behavior when one or more configured key columns are missing on a particular event: fail, empty, or skip (drop that column from the key). |
| caseSensitive | true / false. Whether column-name matching is case-sensitive. |
| empty | Value to use when the resulting key is empty. |
| null | Replacement value for null values in any key column. |
Transforming CDC string columns with cdc_string_transform()
cdc_string_transform() mutates existing CDC string columns in place. The primary use case is trimming fixed-width CHAR columns from SQL Server or MySQL before loading them into targets that don't tolerate trailing spaces (MongoDB, JSON destinations, downstream APIs).
The function is side-effect-only — it doesn't add a new column, it modifies the columns selected by its parameters. To call it, assign it to a throwaway column name that the engine ignores (anything prefixed with _ by convention):
_1=cdc_string_transform(types='char';ops='trim')
Note the ; separator between parameters (not ,) — , is reserved for separating top-level entries in Extra Columns.
Examples
Trim every CHAR column in every CDC event. Most common pattern for SQL Server / MySQL CHAR-padded fields.
_1=cdc_string_transform(types='char';ops='trim')
Trim and uppercase a single named column.
_1=cdc_string_transform(fields='status';ops='trim,upper')
Per-table column targeting — trim the listed columns on the listed tables only.
_1=cdc_string_transform(overrides='dbo.orders:status_code,ship_state|dbo.customers:customer_code';ops='trim')
Literal find / replace on a single column.
_1=cdc_string_transform(fields='status';ops='replace';find='PENDING';replace='OPEN')
Regex find / replace — collapse runs of whitespace to a single space.
_1=cdc_string_transform(fields='description';ops='replace';find='\s+';replace=' ';regex=true)
Selecting which columns to transform
| Selector | What it does |
|---|---|
| fields / columns | Comma-separated list of column names to transform. |
| types | Comma-separated list of source data types (char, varchar, text, …). Exact match. |
| typePattern / typeRegex | Regex match against the source data type. Caution: typePattern='char' matches both CHAR and VARCHAR because it's a regex search — use types='char' for exact-match CHAR-only selection. |
| tables | Comma-separated list of fully-qualified table names to apply the transform to. Tables not listed are unaffected. |
| overrides / override | Per-table column targeting. Format: table1:col_a,col_b|table2:col_c. |
| exclude | Comma-separated columns to skip even if they would otherwise match the selector. |
| caseSensitive | true / false. Whether column-name matching is case-sensitive. |
| missing | fail, empty, or skip — behavior when a configured column is missing from a particular event. |
Operations (parameter ops, comma-separated when chaining multiple)
| Operation | What it does |
|---|---|
| trim | Remove leading and trailing whitespace. |
| upper | Convert to uppercase. |
| lower | Convert to lowercase. |
| replace | Literal or regex find / replace. Requires find and replace parameters; set regex=true to treat find as a regex. |
Execution order — how mutators and composite_key() interact
Functions in Extra Columns run in the order they are listed. Each function sees the changes made by the functions before it. This matters when you combine cdc_string_transform() with composite_key():
Correct — trim CHAR columns first, then build the Mongo ID from the trimmed values:
_1=cdc_string_transform(types='char';ops='trim'), mongoId=composite_key()
The trim runs first, so the resulting mongoId is built from clean, untrimmed values.
Reversed — mongoId is built from the raw values:
mongoId=composite_key(), _1=cdc_string_transform(types='char';ops='trim')
mongoId is computed first, capturing the padded CHAR values. The trim then runs but it's too late to affect mongoId.
You can chain multiple transforms with distinct ignored names:
_1=cdc_string_transform(fields='status';ops='trim,upper'), _2=cdc_string_transform(types='char';ops='trim'), mongoId=composite_key()
When to use these helpers vs. a JavaScript preprocessor
These helpers replace the common JavaScript CDC preprocessors customers used to write to:
- Build MongoDB _id values from composite source primary keys.
- Trim padded CHAR fields before loading into a target that doesn't tolerate them.
- Normalize selected string columns (upper / lower / replace).
- Avoid maintaining a hand-written dictionary of table-to-key-column mappings when Debezium provides the PK metadata already.
The built-in functions are configuration, not code — easier to maintain, version-control-friendly, and reusable across flows. Existing JavaScript preprocessors continue to work, and the existing extra-column behavior (cdc_op, cdc_key, hardcoded values, data-conversion functions, source attributes) is unchanged. Use the JavaScript preprocessor when the transformation logic is genuinely complex; use composite_key() and cdc_string_transform() for the common patterns.
Modifying CDC Event and CDC Key
A CDC event represents a single row (record) in the CDC stream. A CDC key is a unique identifier that groups related events together. All events with the same CDC key are written to the same output file, topic, table, etc.
You can use a Preprocessor to modify either the event itself or the CDC key before it is processed or stored.
Preprocessor Basics
Supported scripting language: JavaScript
Available variables:
-
event: the CDC event as a JsonNode
-
key: the CDC key
-
source: the CDC source (also a JsonNode)
-
transactionId: the transaction ID (if transaction markers are enabled)
Accessing and Modifying Event Data
To get the value of a column (as a string):
var val = event.get('column_name').asText();
To set or replace a column value:
event.put('column_name', 'new_value');
Return Values
Your script must return one of the following:
-
true → process the event as is
-
false → skip the event entirely
-
com.toolsverse.util.TypedKeyValue<Boolean, String> → process the event, and override the CDC key
Example: Override the CDC Key
In this example:
-
The event will be processed (true)
-
The CDC key will be overridden with abc, causing the event to be grouped accordingly
var newCdcKey = 'abc';
value = new com.toolsverse.util.TypedKeyValue(true, newCdcKey);
Example: Conditionally Skip CDC Event
In this example:
- The event will not be processed if column has a value 'false'
var colValue = event.get('column_name').asText();
value = !colValue.equals('false');
Transaction Markers
In many real-world scenarios—such as an e-commerce system—individual business operations span multiple database tables. For example, a single purchase order may involve PURCHASE_ORDER, ORDER_LINE, and SHIPMENT_ADDRESS tables. Retrieving the full picture of a transaction requires joining these tables together.
However, with change data capture (CDC), events are streamed one row at a time and can arrive out of order. This makes it difficult to determine which events belong to the same logical transaction.
To solve this, you can enable transaction metadata in the CDC connection. When enabled, the connector emits special transaction marker events that indicate the start and end of each transaction.
How to Enable Transaction Markers
To enable transaction tracking:
-
Open the CDC connection and enable Provide Transaction Metadata.
-
(Optional) Enable Log Transaction Metadata to write additional metadata (such as involved tables and record counts) to the flow log.
-
(Optional) Enable Log Transaction Start/End to log start and end markers for visibility and debugging.
What Happens When Transaction Metadata Is Enabled
When Provide Transaction Metadata is enabled and the CDC flow is configured to write files:
-
The system creates a separate.end file for each committed transaction involving monitored tables.
-
This.end file is written to the same location as the CDC event files.
-
The file is generated after the transaction commits, using the naming format:
yyyyMMddHHmmssSSS_transactionId.end
This ensures that files sort chronologically by transaction time.
-
All CDC event files created during the transaction will include the same transaction ID as a suffix, allowing them to be easily grouped.
This mechanism helps downstream flows or consumers detect the end of a transaction, group events accordingly, and maintain transactional integrity during processing.