Overview
By default, Etlworks CDC connectors are pre-configured to support common use cases across supported databases. This guide outlines advanced or less common scenarios that may require additional configuration, grouped by database engine.
MySQL
Configuring MySQL for CDC
To enable Change Data Capture (CDC) certain configuration steps must be performed on the source MySQL server. This guide walks you through the required setup, including user permissions, binary logging, GTID configuration, and optional tuning for performance and observability.
1. Create a MySQL User with CDC Permissions
Etlworks requires a MySQL user account with the ability to access the binary log and perform snapshots. If using a managed service such as Amazon RDS or Aurora, additional permissions may be required.
Steps:
1. Connect to the MySQL server as a privileged user.
2. Create a new user account:
CREATE USER 'etlworks'@'%' IDENTIFIED BY 'your_secure_password';
3. Grant the necessary privileges:
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'etlworks'@'%';
4. If using Amazon RDS or Aurora (or another managed service that restricts global locking), also grant:
GRANT LOCK TABLES ON *.* TO 'etlworks'@'%';
5. Apply the changes:
FLUSH PRIVILEGES;
Permissions explained:
|
Permission |
Purpose |
|---|---|
|
SELECT |
Required to read data during the initial snapshot |
|
RELOAD |
Allows flushing tables when acquiring snapshot consistency |
|
SHOW DATABASES |
Allows listing all databases |
|
REPLICATION SLAVE |
Enables reading the binary log |
|
REPLICATION CLIENT |
Required for querying replication status and binary logs |
|
LOCK TABLES |
Used for snapshot locking in restricted environments like RDS |
NOTE: Some cloud providers restrict certain permissions. If LOCK TABLES is not allowed, Etlworks will attempt to use alternative snapshot modes automatically.
2. Enable Binary Logging
CDC depends on the binary log (binlog) to capture changes. Binary logging must be enabled with row-level logging.
For Self-Managed MySQL:
Steps:
1. Check if binary logging is enabled:
MySQL 5.x:
SELECT variable_value FROM information_schema.global_variables
WHERE variable_name = 'log_bin';
MySQL 8.x:
SELECT variable_value FROM performance_schema.global_variables WHERE variable_name = 'log_bin';
2. If not enabled, update your MySQL configuration with the following:
server-id = 223344
log_bin = mysql-bin
binlog_format = ROW
binlog_row_image = FULL
binlog_expire_logs_seconds = 864000
3. Restart the MySQL server after making changes.
4. Confirm that binary logging is now enabled by rerunning the previous query.
For AWS RDS / Aurora:
-
Binary logging is automatically enabled only if automated backups are enabled.
-
Set the binlog_format and binlog_row_image using a custom parameter group:
|
Parameter |
Value |
|---|---|
|
binlog_format |
ROW |
|
binlog_row_image |
FULL |
|
binlog_expire_logs_days |
10 |
Then apply the parameter group to the RDS instance and reboot.
For Azure Database for MySQL:
-
Azure MySQL does not support logical replication (no access to binlog).
-
CDC is not supported on Azure Database for MySQL.
-
Use snapshot-based extraction or migrate to a supported source.
For Google Cloud SQL for MySQL:
1. Enable binary logging in the Cloud Console or via gcloud:
gcloud sql instances patch INSTANCE_NAME \
--database-flags log_bin=on,binlog_format=ROW,binlog_row_image=FULL
2. Reboot the instance for changes to take effect.
Note: Ensure automated backups are enabled to retain binlogs.
3. (Optional) Enable Global Transaction Identifiers (GTIDs)
GTIDs can simplify replication and recovery, especially in failover scenarios. While not required, enabling GTIDs is recommended for clustered environments.
Steps:
1. In your MySQL configuration file, set:
gtid_mode = ON
enforce_gtid_consistency = ON
2. Restart MySQL and verify the settings:
SHOW GLOBAL VARIABLES LIKE '%GTID%';
Settings explained:
|
Setting |
Description |
|---|---|
|
gtid_mode |
Enables globally unique transaction identifiers |
|
enforce_gtid_consistency |
Ensures only GTID-safe statements are allowed |
4. (Recommended) Extend Session Timeouts
For large datasets, the snapshot phase may take longer than the default MySQL session timeout. To avoid unexpected disconnections, increase timeout values.
Steps:
1. Add the following to your MySQL configuration file:
wait_timeout = 7200
interactive_timeout = 7200
2. Restart the server to apply.
Settings explained:
|
Setting |
Description |
|---|---|
|
wait_timeout |
Time before a non-interactive session is closed due to inactivity |
|
interactive_timeout |
Time before an interactive session (e.g., from a client tool) is closed |
5. (Optional) Include Original SQL Statements in Binlog
To enable visibility into the original SQL statements that triggered change events, enable the appropriate logging flag.
Steps:
For MySQL:
binlog_rows_query_log_events = ON
For MariaDB:
binlog_annotate_row_events = ON
This setting allows the connector to include SQL text in the change stream, which is useful for auditing or debugging.
6. Validate and Clear binlog_row_value_options
MySQL 8+ includes the binlog_row_value_options setting, which must not be set to PARTIAL_JSON. This option can cause problems when parsing JSON updates.
Steps:
1. Check the current value:
SHOW GLOBAL VARIABLES WHERE variable_name = 'binlog_row_value_options';
2. If set to PARTIAL_JSON, clear it:
SET GLOBAL binlog_row_value_options = '';
Configuration Summary for MySQL
To enable CDC in Etlworks with MySQL, complete the following setup:
|
Area |
Task |
|---|---|
|
User permissions |
Create and grant privileges to a dedicated MySQL user |
|
Binary logging |
Enable binlog with ROW format and FULL row image |
|
GTID (optional) |
Enable gtid_mode and enforce_gtid_consistency |
|
Session timeout tuning |
Increase wait_timeout and interactive_timeout for snapshots |
|
SQL visibility (optional) |
Enable query logging to capture original SQL in the binlog |
|
JSON handling |
Ensure binlog_row_value_options is not set to PARTIAL_JSON |
Once your database is configured, you can proceed to create the CDC connection and flow in Etlworks.
Platform Support Matrix
|
Platform |
CDC Support |
Notes |
|---|---|---|
|
Self-managed MySQL |
✅ |
Full support |
|
Amazon RDS / Aurora |
✅ |
Ensure binary logging + backups are enabled |
|
Google Cloud SQL for MySQL |
✅ |
Enable binlog and required flags via gcloud or console |
|
Azure Database for MySQL |
❌ |
Does not support CDC via binary log; use snapshots or migrate source |
MySQL Snapshot Modes
Snapshot mode defines how and when the connector captures the current state of the source database.
- ad-hoc initial: the connector runs a snapshot upon startup whenever it deems it necessary (when no offsets are available, or when a previously recorded offset specifies a binlog location or GTID that is not available in the server), then starts streaming changes from a binlog. This mode allows adding new tables and triggering incremental snapshots at runtime.
- ad-hoc schema only: use it if you don’t need a consistent snapshot of the data to be captured on startup but only need the changes since the connector was started. This mode allows adding new tables and triggering incremental snapshots at runtime.
- initial: the connector runs a snapshot only when no offsets have been recorded for the logical server name, then starts streaming changes from the binlog.
- initial_only: the connector runs a snapshot only when no offsets have been recorded for the logical server name and then stops, i.e. it will not read change events from the binlog.
- schema_only: use it if you don’t need a consistent snapshot of the data but only need the changes since the connector was started.
- when_needed: the connector runs a snapshot upon startup whenever it deems it necessary (when no offsets are available, or when a previously recorded offset specifies a binlog location or GTID that is not available in the server), then starts streaming changes from the binlog.
- schema_only_recovery: is a recovery mode for an existing connector to recover a corrupted or lost database history, or to periodically clean up a database history that may be growing unexpectedly.
- always_recover_schema: this mode is similar to when_needed but it always triggers the schema refresh on startup so it can be used to add new tables.
- schema_only_always_recover_schema: this mode is similar to schema_only but it always triggers the schema refresh on startup so it can be used to add new tables.
Adding New Tables to the CDC Pipeline from MySQL
To add new tables:
-
Ensure the snapshot mode is one of the following:
-
ad-hoc initial
-
ad-hoc schema only
-
-
Enable Automatically trigger ad-hoc snapshot for new tables.
-
Ensure database.history.store.only.captured.tables.ddl is enabled (default).
-
Stop the Flow.
-
Add fully qualified table names to the FROM field in the CDC flow. IMPORTANT: Regular expressions are not supported.
-
Restart the Flow.
Support for Compressed Binlog
The MySQL CDC connector supports reading from compressed binary logs with no additional configuration required.
Handling Spatial (Geometry) Data in MySQL
By default, geometry fields are encoded as WKB and then Base64-encoded.
To convert to human-readable JSON, enable: Convert Geometry WKB to JSON.
Starting from Specific Binlog File/Position
The MySQL CDC connector can be configured to start streaming change events from a specific point in the source database’s binary log (binlog). This is useful when you need to resume replication from a known offset or to recover from a failure without re-running a snapshot.
In the CDC Connection settings under Starting Position After Snapshot, you can specify:
-
Starting binlog filename: Enter the name of the binlog file (for example, mysql-bin.000123).
-
Starting binlog position: Enter the position within the specified binlog file (for example, 154).
-
Starting GTID: Optionally, enter a Global Transaction Identifier (GTID) to start from a specific transaction instead of a file/position.
This is supported only when the connector transitions from snapshot to streaming and the snapshot mode is one of:
-
ad-hoc initial
-
ad-hoc schema only
-
initial
-
schema_only
-
always_recover_schema
-
schema_only_always_recover_schema
Use Cursor-Based Snapshot (Avoid Streaming Errors)
In rare cases, large snapshots may cause: “No statements may be issued when any streaming result sets are open…”
To avoid this, add the following property to Other Parameters:
database.useCursorFetch=true
This uses a temp table for snapshotting, which increases MySQL resource usage but avoids the error.
PostgreSQL
Configuring PostgesSQL for CDC
Before using PostgreSQL as a source for Change Data Capture (CDC) in Etlworks, the database must be configured to support logical replication. This includes enabling logical replication in the server configuration, ensuring replication slots are available, and assigning correct privileges to the CDC user.
The PostgreSQL CDC connector in Etlworks uses the built-in pgoutput plugin to stream changes from the database. This is supported in PostgreSQL 10 and above.
PostgreSQL on AWS RDS
1. Enable logical replication
Set the rds.logical_replication parameter to 1 in your DB instance parameter group.
This automatically enables wal_level = logical.
2. Restart the RDS instance
After updating the parameter group, restart the database to apply changes.
3. Verify WAL settings
Confirm that wal_level is set to logical:
SHOW wal_level;
4. User permissions
The user connecting from Etlworks must have the rds_replication role:
GRANT rds_replication TO etlworks_user;
To verify:
SELECT pg_has_role('etlworks_user', 'rds_replication', 'member');
Should return true.
5. Required privileges
-
LOGIN privilege (required to connect)
-
CONNECT privilege on the database (usually granted by default)
-
REPLICATION privilege (granted via rds_replication)
-
SELECT privilege on all tables being capturedPostgreSQL on Azure Database for PostgreSQL
1. Enable logical replication
Set the azure.replication_support parameter to logical. Use Azure CLI:
az postgres server configuration set \
--resource-group <resource_group> \
--server-name <server_name> \
--name azure.replication_support \
--value logical
2. Restart the instance
az postgres server restart \
--resource-group <resource_group> \
--name <server_name>
3. Ensure access from Etlworks IPs
Configure firewall rules in the Azure portal to allow connections from the Etlworks IP ranges.
PostgreSQL on CrunchyBridge
-
Logical replication is enabled by default.
-
pgoutput plugin is pre-installed.
-
You must create a replication user and grant necessary privileges:
CREATE ROLE etlworks_user REPLICATION LOGIN;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO etlworks_user;
PostgreSQL Server Configuration
The following settings must be applied in postgresql.conf or your cloud provider’s parameter group:
wal_level = logical
max_wal_senders = 10
max_replication_slots = 10
Additional optional settings for better performance and lower latency:
wal_writer_delay = 10ms
Host Access Configuration for self-managed PostgreSQL (pg_hba.conf)
To allow the Etlworks agent to connect and replicate changes:
# Example entries:
host replication etlworks_user 0.0.0.0/0 md5
host all etlworks_user 0.0.0.0/0 md5
These entries must match the IP address of the Etlworks Integration Agent or the Etlworks cloud service.
Setting Up Permissions
Etlworks requires a PostgreSQL user with minimal necessary privileges:
-- Create replication role
CREATE ROLE etlworks_user REPLICATION LOGIN PASSWORD 'your_password';
-- Grant read access
GRANT SELECT ON ALL TABLES IN SCHEMA public TO etlworks_user;
-- (Optional) Allow auto-creation of publications
GRANT CREATE ON DATABASE your_database TO etlworks_user;
If you want Etlworks to automatically create publications for selected tables, assign the CREATE privilege on the database.
Additional Requirements for Filtered Publications
If you’re using filtered publications, PostgreSQL enforces stricter rules:
The CDC user must be the owner of each table included in the filtered publication.
Example error if this is not the case:
ERROR: must be owner of table table_name
To change ownership of a table:
ALTER TABLE public.table_name OWNER TO etlworks_user;
Publications
Etlworks CDC uses PostgreSQL publications to stream changes from one or more tables. A publication is a logical grouping of tables whose data changes (INSERT, UPDATE, DELETE) are captured by the database and sent to the CDC consumer (Etlworks). Every PostgreSQL CDC connection in Etlworks must specify a Publication name.
Use property Publication Autocreate mode to control how (or if) the publication is created.
There are three modes for managing publications in Etlworks:
1. filtered (auto-create selected tables), recommended for most cases
When this mode is selected:
-
Etlworks will automatically create a publication with the name specified in the Publication Name field.
-
The publication will include only the tables selected in the CDC flow.
-
If the publication already exists, it will be reused and updated as needed (e.g., if new tables are added).
-
A PostgreSQL user must have both CREATE and SELECT privileges to allow Etlworks to manage the publication.
This mode is ideal for typical CDC pipelines and supports granular control of which tables are monitored.
2. all_tables (auto-create for all tables in the database)
When this mode is selected:
-
Etlworks will create a publication that includes all tables in the current database schema.
-
This is useful in environments where you want to replicate every table without manually managing the list.
-
The user must still have CREATE and SELECT privileges.
-
This mode is only supported if the PostgreSQL server allows listing and accessing all user tables.
Note: If any tables are excluded by filters in the flow or by column-level permissions, they will not emit changes even if present in the publication.
3. disabled (manual publication only), recommended for production
When this mode is selected:
-
Etlworks will not create or manage the publication.
-
You must manually create and maintain the publication on the database.
Example SQL to create a manual publication:
CREATE PUBLICATION etlworks_pub FOR TABLE your_table1, your_table2;
Add new tables and removed tables explicitly:
ALTER PUBLICATION etlworks_pub ADD TABLE order;
ALTER PUBLICATION etlworks_pub REMOVE TABLE order;
This mode is recommended for production systems where publication management is restricted or audited, or when replication scope is tightly controlled.
Summary of Required Connection Fields
|
Field |
Description |
|---|---|
|
Output Plugin |
Must be pgoutput. This is the default plugin in PostgreSQL 10+ |
|
Replication Slot Name |
Unique slot name used to track CDC changes (e.g., etlworks_slot_1) |
|
Publication Name |
Name of the publication used or created by Etlworks (e.g., etlworks_pub) |
|
Publication Autocreate Mode |
One of: filtered, all_tables, disabled |
Managing WAL Retention and Disk Space
PostgreSQL uses Write-Ahead Logging (WAL) to record changes before they are applied to tables. Etlworks CDC relies on these WAL records to capture change events in near real-time. However, in certain environments, WAL files can accumulate and consume excessive disk space if not managed correctly.
The most common causes and solutions are described below.
1. Understanding LSNs and WAL Cleanup
PostgreSQL maintains metadata for each replication slot in the pg_replication_slots view. Two key columns to monitor are:
-
confirmed_flush_lsn: the latest WAL position acknowledged by the Etlworks CDC connector.
-
restart_lsn: the oldest WAL segment still needed by the connector.
PostgreSQL only reclaims WAL files that are no longer needed. If confirmed_flush_lsn advances regularly but restart_lsn remains stagnant, space will not be freed. However, PostgreSQL typically reclaims disk space in batch blocks, and this behavior is normal — no immediate action is required unless WAL usage grows unexpectedly.
2. Idle Source Tables and WAL Growth
If many changes occur in the database overall, but the specific tables monitored by Etlworks CDC remain idle, the connector may not confirm new LSNs. PostgreSQL continues to retain WAL segments because it cannot confirm they are no longer needed — even though most of the data has nothing to do with the CDC tables.
To address this:
-
Enable heartbeat messages using the heartbeat.interval.ms configuration property in Other Parameters.
-
Include a heartbeat table in the publication, so the connector can track its changes.
-
Emit changes proactively using heartbeat.action.query property in Other Parameters: INSERT INTO public.heartbeat (text) VALUES ('heartbeat')
When configured correctly, heartbeat events allow the connector to process WAL records regularly — even when the CDC tables are idle — which in turn enables PostgreSQL to safely clean up old WAL files.
3. Handling Multi-Database Hosts (e.g., RDS)
When multiple databases share the same PostgreSQL host (as is common in Amazon RDS, Azure Database for PostgreSQL, and Google Cloud SQL), WAL is shared across all databases, but replication slots are per-database. This means:
-
A high-traffic database can generate WAL segments continuously.
-
A low-traffic database (the one being tracked by Etlworks CDC) may not emit enough events to update its replication slot.
-
PostgreSQL cannot safely remove WAL segments, since the low-traffic database’s replication slot hasn’t acknowledged them.
This results in unbounded WAL growth, even if the monitored database appears idle.
To prevent this:
-
Enable heartbeat messages using the heartbeat.interval.ms configuration property in Other Parameters.
-
Include a heartbeat table in the publication, so the connector can track its changes.
-
Emit changes proactively using heartbeat.action.query property in Other Parameters: INSERT INTO public.heartbeat (text) VALUES ('heartbeat')
Heartbeat Configuration for Idle Databases
Step 1. Make sure only monitored tables are included in publication
Step 2. Create a heartbeat table:
Modify schema and table name if needed. Make sure schema is included in the list of monitored schemas in the Postgres CDC connection.
create table public.cdc_heartbeat (id SERIAL PRIMARY KEY, ts timestamptz)
Step 3. Add the heartbeat table to the publication if needed
Modify publication name and table name if needed.
ALTER PUBLICATION etlworks_pub ADD TABLE cdc_heartbeat;
Step 4. Add cdc_heartbeat table to the list of monitored tables (FROM in CDC flow)
Step 5. In the Etlworks CDC connection under Other Parameters set the following parameters:
heartbeat.interval.ms = 100000
and
heartbeat.action.query = INSERT INTO public.cdc_heartbeat (id, ts) VALUES (1, NOW()) ON CONFLICT(id) DO UPDATE SET ts=NOW()
This ensures that even if your tracked tables are idle, the connector continues to process WAL segments and prevent unnecessary disk growth.
Clustered PostgreSQL Deployments
PostgreSQL supports logical replication only on the primary node. Etlworks must connect to the primary instance.
-
If the primary fails and a new node is promoted, you must reconfigure the CDC connection in Etlworks to point to the new primary.
-
Logical replication is not supported on read replicas.
PostgeSQL Snapshot Modes
Snapshot mode defines how and when the connector captures the current state of the source database.
- ad-hoc initial: the connector performs an initial snapshot only when no offsets have been recorded for the logical server name. This mode allows adding new tables and triggering incremental snapshots at runtime.
- ad-hoc: the connect never runs a snapshot and upon the first startup with a logical server name the connector reads data either from where it last left off (last LSN position) or starts from the beginning from the point of the view of the logical replication slot. This mode allows adding new tables and triggering incremental snapshots at runtime.
- initial: the connector performs a snapshot only when no offsets have been recorded for the logical server name.
- initial_only: the connector only takes an initial snapshot and then stops, without processing any subsequent changes.
- never: the connect never runs a snapshot and upon the first startup with a logical server name the connector reads data either from where it last left off (last LSN position) or starts from the beginning from the point of the view of the logical replication slot.
- always: the connector runs a snapshot each time on startup.
- exported: deprecated and equals to initial.
Adding New Tables to the CDC Pipeline from PostgreSQL
To add new tables:
Step 1. Make sure Automatically trigger ad-hoc snapshot for new tables is enabled in CDC connection.
Step 2. Stop the Flow.
Step 3. If PostgreSQL CDC connection configured with Publication Autocreate mode set to disabled you will need to update the publication manually using SQL.
Add new tables explicitly:
ALTER PUBLICATION etlworks_pub ADD TABLE order;
Remove tables:
ALTER PUBLICATION etlworks_pub REMOVE TABLE order;
Step 4. Add fully qualified table names to the FROM field in the CDC flow. IMPORTANT: Regular expressions are not supported.
Step 5. Restart the Flow.
SQL Server
Configuring SQL Server for CDC
Etlworks can continuously capture INSERT, UPDATE, and DELETE operations from SQL Server using the database’s native Change Data Capture (CDC) feature.
To use CDC with SQL Server, a database administrator must first enable CDC at the database level, then on each table that should be monitored. This must be done manually before configuring a CDC flow in Etlworks.
Prerequisites
Before configuring CDC for SQL Server, ensure the following:
-
You are a member of the sysadmin server role.
-
You are a member of the db_owner role on the database.
-
The SQL Server Agent is running.
-
The database is not the master database (CDC cannot be enabled on master).
-
The database contains user-created tables (CDC cannot track system tables).
Step 1: Enable CDC at the Database Level
To enable CDC on the database:
USE [YourDatabase]
GO
EXEC sys.sp_cdc_enable_db
GO
Once enabled, SQL Server automatically creates a schema named cdc, along with supporting metadata tables, system functions, and stored procedures.
Step 2: Enable CDC on Specific Tables
USE [YourDatabase]
GO
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name = N'MyTable',
@role_name = N'MyRole',
@filegroup_name = N'MyFilegroup',
@supports_net_changes = 0
GO
Notes:
-
Replace MyTable with the actual table name.
-
Replace MyFilegroup with the name of a filegroup (must already exist).
-
@role_name defines which users can query the change table. Use NULL to restrict access to sysadmin and db_owner.
Step 3: Verify CDC Is Enabled
To verify that CDC is enabled on a database or table:
USE [YourDatabase]
GO
EXEC sys.sp_cdc_help_change_data_capture
GO
This returns configuration details for all CDC-enabled tables. If the result is empty, the user may not have sufficient permissions.
Using Etlworks with SQL Server CDC
After CDC is enabled at the database and table level, you can configure a CDC Connection and Flow in Etlworks.
-
Etlworks will automatically detect the list of CDC-enabled tables.
-
Flows will stream committed changes as soon as they are made available by SQL Server.
Note: The SQL Server Agent must remain active. If the Agent is stopped, change events will not be captured.
Performance Considerations
SQL Server uses a capture job agent to scan the transaction log and write changes to CDC tables. This job runs periodically and introduces some latency between source changes and CDC table updates.
If you’re observing high CPU load or latency, tune the capture agent parameters.
Capture Job Agent Parameters
The capture agent configuration is stored in msdb.dbo.cdc_jobs. Use the procedure below to change settings:
EXEC sys.sp_cdc_change_job
@job_type = N'capture',
@pollinginterval = 10, -- in seconds
@maxtrans = 200,
@maxscans = 5
GO
Parameter descriptions:
-
@pollinginterval: Wait time between log scans. Higher values reduce CPU usage but increase latency.
-
@maxtrans: Maximum number of transactions to process per scan.
-
@maxscans: Maximum number of scan cycles before sleeping.
Tip: The default polling interval is 5 seconds. Increase to 10 or 15 seconds if your server is under load.
SQL Server Always On Support
Etlworks supports capturing changes from an Always On read-only replica. You must:
-
Enable CDC on the primary node (not the replica).
-
Set applicationIntent=ReadOnly in the JDBC connection string.
-
Ensure snapshot.isolation.mode is set to snapshot (Etlworks sets this automatically for read-only replicas).
-
The replica must periodically reconnect to maintain the latest view of the CDC tables.
Connecting Without SSL
By default, JDBC connections to SQL Server are encrypted using SSL. If you are connecting to a database without SSL, or prefer to disable encryption:
-
Set database.encrypt=false in the connection properties under Other Parameters.
SQL Server Snapshot Modes
Snapshot mode defines how and when the connector captures the current state of the source database.
- ad-hoc initial: the connector takes a snapshot of the structure and data of captured tables, then starts streaming change events from the captured tables. This mode allows adding new tables and triggering incremental snapshots at runtime.
- ad-hoc schema only: the connector takes a snapshot of the structure of captured tables only, then starts streaming change events from the captured tables. This mode allows adding new tables and triggering incremental snapshots at runtime.
- initial: the connector takes a snapshot of the structure and data of captured tables, then starts streaming change events from the captured tables. It is useful if a complete representation of the data from the included tables should be captured.
- initial_schema_only: the connector takes a snapshot of the structure of captured tables only. It is useful if only changes happening from now onwards should be propagated downstream. Once the snapshot is complete, the connector will continue reading change events from the database’s redo logs.
- initial only: the connector performs a database snapshot and stops before streaming any change event records.
Adding New Tables to the CDC Pipeline from SQL Server
To add new tables:
-
Ensure the snapshot mode is one of the following:
-
ad-hoc initial
-
ad-hoc schema only
-
-
Enable Automatically trigger ad-hoc snapshot for new tables.
-
Stop the Flow.
-
Add fully qualified table names to the FROM field in the CDC flow. IMPORTANT: Regular expressions are not supported.
-
Restart the Flow.
Avoid Locking During Snapshot
Set Snapshot Isolation Mode to read_uncommitted to prevent table locks.
Handling Binary and Spatial Data
For compatibility and readability add the following property to Other Parameters:
binary.handling.mode=hex
This serializes binary and geometry columns as hex-encoded strings.
Query All Tables for Changes — When to Leave it Off
The SQL Server CDC connection has an option labeled Query all tables for changes. By default it is disabled, and the connector consults SQL Server's change tracking metadata to decide which captured tables actually have changes since the last polling cycle. Only those tables are queried.
When the option is enabled, the connector queries every captured table on every polling cycle regardless of whether SQL Server has reported any changes for it. The intent is to recover from rare cases where SQL Server's change tracking fails to surface a change.
When to leave it off. Leave it disabled in almost all cases. On a typical workload most captured tables receive no updates between cycles, and querying them anyway multiplies the per-cycle load on SQL Server by the number of captured tables for no benefit. The Etlworks Flow Findings inspector flags this option being enabled as a Minor performance issue.
When to enable it. Only enable it if you have confirmed, in your specific environment, that SQL Server is reporting changes incorrectly and the connector is missing events as a result. This situation is rare. See also CDC settings reference → Query all tables for changes.
Oracle
Configuring Oracle for CDC
This article explains how to configure Oracle to work with Etlworks Change Data Capture (CDC).
The instructions apply to both on-premise Oracle databases and Oracle on AWS RDS, and assume you’re using a multi-tenant setup (container + pluggable databases).
Adjustments may be necessary for single-tenant installations.
Etlworks CDC for Oracle uses LogMiner to continuously capture changes from Oracle redo logs. Before you create a CDC Flow in Etlworks, you must:
-
Enable archivelog mode
-
Configure LogMiner
-
Enable supplemental logging
-
Create and configure a CDC user with appropriate permissions
1. Enable ArchiveLog and Configure LogMiner
On-Premise Oracle
If you’re running Oracle outside of AWS (with access to SYSDBA), run the following commands:
-- Log in as SYSDBA
sqlplus / as sysdba
-- Set up archive logging
ALTER SYSTEM SET db_recovery_file_dest_size = 10G;
ALTER SYSTEM SET db_recovery_file_dest = '/opt/oracle/oradata/recovery_area' SCOPE=spfile;
SHUTDOWN IMMEDIATE;
STARTUP MOUNT;
ALTER DATABASE ARCHIVELOG;
ALTER DATABASE OPEN;
-- Confirm archive mode is active
ARCHIVE LOG LIST;
EXIT;
Oracle on AWS RDS
If you’re using Oracle RDS, you don’t have SYSDBA access, so use the RDS-provided stored procedures:
-- Confirm archivelog is enabled
SELECT LOG_MODE FROM V$DATABASE;
-- Should return: ARCHIVELOG
-- Set retention period
EXEC rdsadmin.rdsadmin_util.set_configuration('archivelog retention hours', 24);
-- Enable minimal supplemental logging
EXEC rdsadmin.rdsadmin_util.alter_supplemental_logging('ADD');
2. Enable Supplemental Logging
Etlworks CDC requires supplemental logging to capture before values and primary keys.
At the database level:
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
At the table level:
ALTER TABLE schema_name.table_name ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
3. Create a CDC User
Etlworks uses a dedicated Oracle user to read redo logs and monitor changes. Create this user with sufficient privileges.
Example (on-prem with container DB):
-- Log in to the root container
sqlplus sys/password@//host:port/ORCLCDB as sysdba
-- Create tablespace
CREATE TABLESPACE logminer_tbs DATAFILE '/opt/oracle/oradata/ORCLCDB/logminer_tbs.dbf'
SIZE 25M REUSE AUTOEXTEND ON MAXSIZE UNLIMITED;
-- Create common user
CREATE USER c##etlworks IDENTIFIED BY your_password
DEFAULT TABLESPACE logminer_tbs
QUOTA UNLIMITED ON logminer_tbs
CONTAINER=ALL;
-- Grant required privileges
GRANT CREATE SESSION TO c##etlworks CONTAINER=ALL;
GRANT SET CONTAINER TO c##etlworks CONTAINER=ALL;
GRANT FLASHBACK ANY TABLE TO c##etlworks CONTAINER=ALL;
GRANT SELECT ANY TABLE TO c##etlworks CONTAINER=ALL;
GRANT SELECT_CATALOG_ROLE TO c##etlworks CONTAINER=ALL;
GRANT EXECUTE_CATALOG_ROLE TO c##etlworks CONTAINER=ALL;
GRANT SELECT ANY TRANSACTION TO c##etlworks CONTAINER=ALL;
GRANT CREATE TABLE TO c##etlworks CONTAINER=ALL;
GRANT CREATE SEQUENCE TO c##etlworks CONTAINER=ALL;
GRANT LOCK ANY TABLE TO c##etlworks CONTAINER=ALL;
-- Access to LogMiner
GRANT EXECUTE ON DBMS_LOGMNR TO c##etlworks CONTAINER=ALL;
GRANT EXECUTE ON DBMS_LOGMNR_D TO c##etlworks CONTAINER=ALL;
-- Access to internal views
GRANT SELECT ON V_$DATABASE TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$LOG TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$LOGFILE TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$ARCHIVED_LOG TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$LOG_HISTORY TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$LOGMNR_LOGS TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$LOGMNR_CONTENTS TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$LOGMNR_PARAMETERS TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$TRANSACTION TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$ARCHIVE_DEST_STATUS TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$MYSTAT TO c##etlworks CONTAINER=ALL;
GRANT SELECT ON V_$STATNAME TO c##etlworks CONTAINER=ALL;
Replace c##etlworks and your_password with your actual username and password.
4. Excluded Tables and Schemas
Etlworks automatically excludes system schemas and internal Oracle tables from CDC.
To capture changes, ensure your tables are in user-defined schemas and not in Oracle internal schemas.
Excluded table patterns:
-
SYS_IOT_OVER_% (index-organized tables)
-
CMP3$%, CMP4$% (Compression Advisor)
-
MDRT_%, MDRS_% (spatial indexes)
-
Nested tables
To capture such tables, rename them if needed.
5. Redo Log and Archive Log Configuration
Ensure your Oracle database has:
-
Adequate redo log size and count to avoid performance bottlenecks
-
Sufficient archive log retention (e.g., 24 hours)
-
At least one local VALID archive destination available
To list archive destinations:
SELECT DEST_ID, STATUS, DESTINATION, TARGET, VALID_NOW
FROM V$ARCHIVE_DEST_STATUS;
6. CDC in Standby Environments
Etlworks requires direct access to the redo logs. Oracle physical standby databases do not allow external connections, so CDC must run against the primary or a logical standby (with writable access).
If failover occurs:
-
Promote the standby to primary
-
Update the connection settings in Etlworks
-
Resume CDC from the new primary
Configuring CDC from Oracle RAC (Real Application Clusters)
Set the rac.nodes property in Other Parameters in CDC connection to enable compatibility with Oracle RAC deployments. This property accepts a comma-separated list of RAC node hostnames or IP addresses.
There are two supported configurations:
1. Using a shared port for all nodes
Set database.port to the shared listener port and list the hostnames or IPs in rac.nodes:
database.port=1521
rac.nodes=192.168.1.100,192.168.1.101
2. Using custom ports per node
Override the port per node directly in the rac.nodes list:
database.port=1521
rac.nodes=192.168.1.100,192.168.1.101:1522
Note: If you’re using a raw JDBC URL (database.url) instead of database.port, you must specify the port explicitly for every entry in rac.nodes.
Connecting to Read-Only Databases
To stream from a read-only Oracle instance:
-
Set CDC Adapter to logminer.
-
Enable Run connector in read-only mode.
This avoids flushing the LGWR buffer, which is not possible in read-only mode.
Use JDBC URL for Reliable Connectivity
Due to Oracle’s flexible JDBC URL formats, always provide a fully qualified JDBC URL, along with Host, Port, Database, and optionally the PDB name.
Oracle Snapshot Modes
Snapshot mode defines how and when the connector captures the current state of the source database.
- ad-hoc initial: the connector takes a snapshot of the structure and data of captured tables, then starts reading change events from the database’s redo logs. This mode allows adding new tables and triggering incremental snapshots at runtime.
- ad-hoc schema only: the connector takes a snapshot of the structure of captured tables only, then starts reading change events from the database’s redo logs. This mode allows adding new tables and triggering incremental snapshots at runtime.
- initial: the connector takes a snapshot of the structure and data of captured tables, then starts reading change events from the database’s redo logs.
- schema_only: the connector takes a snapshot of the structure of captured tables only. It is useful if only changes happening from now onwards should be propagated downstream. Once the snapshot is complete, the connector will continue reading change events from the database’s redo logs.
- schema_only_recovery: set this option to restore a database history that is lost or corrupted. After a restart, the connector runs a snapshot that rebuilds the database history.
- always_recover_schema: This mode is similar to initial but it always triggers the schema refresh on startup so it can be used to add new tables.
- schema_only_always_recover_schema: This mode is similar to schema_only but it always triggers the schema refresh on startup so it can be used to add new tables
Improve Snapshot Performance
Flashback queries (AS OF SCN) can be slow. To skip them:
-
Enable Disable Flashback Query (AS OF) when configuring the snapshot.
Adding New Tables to the CDC Pipeline from Oracle
To add new tables:
-
Ensure the snapshot mode is one of the following:
-
ad-hoc initial
-
ad-hoc schema only
-
-
Enable Automatically trigger ad-hoc snapshot for new tables.
-
Ensure database.history.store.only.captured.tables.ddl is enabled (default).
-
Stop the Flow.
-
Add fully qualified table names to the FROM field in the CDC flow. IMPORTANT: Regular expressions are not supported.
-
Restart the Flow.
Avoid OutOfMemory Errors in Long Transactions
Oracle’s redo logs intermingle events from all concurrent transactions. To manage large transactions without exhausting heap memory:
-
Enable Embedded Infinispan Cache for Log Mining Buffer.
This stores events in an embedded, disk-based cache and reduces memory footprint.
IBM DB2
Configuring DB2 for CDC
To enable CDC on IBM Db2 databases, you must configure the database to capture changes using IBM’s ASN capture agent. Once configured, Etlworks will be able to stream inserts, updates, and deletes from the database.
Prerequisites
-
You must be logged in as a privileged Db2 user (e.g., db2inst1).
-
A recent full database backup is required before enabling CDC.
-
Ensure the bldrtn utility is available in your PATH.
Step 1: Prepare the Capture Environment
1. Bind JDBC metadata for catalog access:
cd $HOME/sqllib/bnd
db2 connect to <DB_NAME>
db2 bind db2schema.bnd blocking all grant public sqlerror continue
2. Backup the database (required by ASN agents):
db2 backup db <DB_NAME> to /your/backup/path
3. Restart the database after backup:
db2 restart db <DB_NAME>
Step 2: Install Control Functions (UDFs)
These are one-time setup steps.
1. Compile and copy control functions:
cd $HOME/asncdctools/src
bldrtn asncdc
cp asncdc $HOME/sqllib/function
chmod 777 $HOME/sqllib/function
2. Install UDFs and control tables:
db2 connect to <DB_NAME>
db2 -tvmf asncdc_UDF.sql
db2 -tvmf asncdctables.sql
db2 -tvmf asncdcaddremove.sql
Step 3: Start the Capture Agent
You can start the ASN capture agent using either SQL or the shell:
Option A: From SQL:
VALUES ASNCDC.ASNCDCSERVICES('start','asncdc');
Option B: From shell:
db2 "VALUES ASNCDC.ASNCDCSERVICES('start','asncdc');"
If the result includes a command like:
start --> /path/to/asncap capture_schema=asncdc capture_server=SAMPLE &
Execute it to complete startup.
Step 4: Enable CDC for Tables
Enable CDC for each individual table you want to capture:
CALL ASNCDC.ADDTABLE('MYSCHEMA', 'MYTABLE');
To apply changes, reinitialize the service:
VALUES ASNCDC.ASNCDCSERVICES('reinit','asncdc');
Tuning Performance and WAL Retention
Db2’s CDC mechanism uses transaction logs and change tables. Over time, the volume of logs and CPU load can increase. To manage this:
Use Heartbeat Table for Low-Traffic Databases
If the source database has low activity, WAL segments may not get acknowledged, causing excessive disk usage.
To solve this, create a heartbeat table and use the following configuration in Etlworks CDC connection:
INSERT INTO heartbeat_table (note) VALUES ('heartbeat')
This query will be executed periodically to trigger change events, allowing Etlworks to advance the log position and free disk space.
Adjust Capture Agent Settings
Use the IBMSNAP_CAPPARMS table to configure agent behavior. The two most impactful parameters are:
-
COMMIT_INTERVAL:
Time in seconds before committing data to change tables.
Higher values reduce CPU usage but increase latency.
Default: 30
-
SLEEP_INTERVAL:
Time in seconds the agent sleeps before starting a new commit cycle.
Higher values reduce load but delay capture.
Default: 5
Example SQL to update settings:
UPDATE IBMSNAP_CAPPARMS
SET COMMIT_INTERVAL = 60, SLEEP_INTERVAL = 10
WHERE SERVER = 'asncdc';
DB2 Snapshot Modes
Snapshot mode defines how and when the connector captures the current state of the source database.
- ad-hoc initial: for tables in capture mode, the connector takes a snapshot of the schema for the table and the data in the table. After the snapshot is completed, the connector continues by reading change events from the database’s redo logs. This mode allows adding new tables and triggering incremental snapshots at runtime.
- ad-hoc schema only: for tables in capture mode, the connector takes a snapshot of only the schema for the table. After the schema snapshot is completed, the connector continues by reading change events from the database’s redo logs. This mode allows adding new tables and triggering incremental snapshots at runtime.
- initial: for tables in capture mode, the connector takes a snapshot of the schema for the table and the data in the table.
- schema_only: for tables in capture mode, the connector takes a snapshot of only the schema for the table. This is useful when only the changes that are happening from now on need to be emitted. After the snapshot is completed, the connector continues by reading change events from the database’s redo logs.
Adding New Tables to the CDC Pipeline from DB2
To add new tables:
-
Enable Automatically trigger ad-hoc snapshot for new tables.
-
Ensure database.history.store.only.captured.tables.ddl is enabled (default).
-
Stop the Flow.
-
Add fully qualified table names to the FROM field in the CDC flow. IMPORTANT: Regular expressions are not supported.
-
Restart the Flow.
AS/400
Configuring AS/400 for CDC
Etlworks can capture changes from IBM i (AS/400) databases using database journals. To enable this functionality, follow the steps below to configure the source system, user privileges, and CDC connection.
Prerequisites
-
IBM i must be configured to use journals on the tables you want to monitor.
-
You must define the necessary user privileges, environment variables, and memory options.
Step 1: Grant Required Privileges to the CDC User
Create or identify a user that Etlworks will use to connect to the IBM i system. This user must be granted read and execute access to the appropriate libraries and journal objects.
Run the following commands on IBM i:
-- Access to the journal library
GRTOBJAUT OBJ(<JRNLIB>) OBJTYPE(*LIB) USER(<CDC_USER>) AUT(*EXECUTE)
-- Access to all journal receivers in the journal library
GRTOBJAUT OBJ(<JRNLIB>/*ALL) OBJTYPE(*JRNRCV) USER(<CDC_USER>) AUT(*USE)
-- Access to the journal object
GRTOBJAUT OBJ(<JRNLIB>/<JRN>) OBJTYPE(*JRN) USER(<CDC_USER>) AUT(*USE *OBJEXIST)
-- Access to the library where application tables reside
GRTOBJAUT OBJ(<PROJECT_LIB>) OBJTYPE(*LIB) USER(<CDC_USER>) AUT(*EXECUTE)
-- Access to all files (tables) in the project library
GRTOBJAUT OBJ(<PROJECT_LIB>/*ALL) OBJTYPE(*FILE) USER(<CDC_USER>) AUT(*USE)
Replace:
-
<JRNLIB>: Library where the journal and receivers reside
-
<JRN>: Journal name
-
<PROJECT_LIB>: Library with the application tables
-
<CDC_USER>: User ID used by Etlworks to connect
Step 2: Configure Partitions for Production
For production environments, it is recommended to increase partitioning and replication for resiliency and scalability:
PARTITIONS=3
REPLICATION_FACTOR=3
Limitations
-
No support for remote journals or automatic failover
-
No support for CLOBs, XML, or large BLOBs
-
Limited support for runtime table structure changes
-
Does not automatically recover if the journal is deleted before being read — a manual resync is required
Best Practices
-
Avoid deleting journals before Etlworks CDC has read all entries
-
Monitor offset logs to detect if resync is needed
-
Use a dedicated user with read-only access for CDC
-
Increase oplog or journal receiver size to support longer downtimes
AS/400 Snapshot Modes
Snapshot mode defines how and when the connector captures the current state of the source database.
- initial: for tables in capture mode, the connector takes a snapshot of the schema for the table and the data in the table.
- schema_only: for tables in capture mode, the connector takes a snapshot of only the schema for the table. This is useful when only the changes that are happening from now on need to be emitted. After the snapshot is completed, the connector continues by reading change events from the database’s redo logs.
- never: never perform as snapshot and start steaming immediately ion startup.
Handling Date and Time Columns
IBM i supports very old date values (for example, 0001-01-01). In certain CDC scenarios, the default AS400 JDBC driver may incorrectly decode dates and timestamps that fall outside standard ranges.
To ensure proper handling of ancient dates and microsecond precision:
-
Set time.precision.mode=adaptive_time_microseconds under Other Parameters.
-
Set alternative.temporal.decoder=true under Other Parameters.
-
Configure the following formats under Formats and Templates:
-
Date and Time Format: yyyy-MM-dd HH:mm:ss.SSSSSS
-
Date Format: yyyy-MM-dd
-
Time Format: HH:mm:ss.SSSSSS
-
Timezone: UTC
-
Setting the timezone to UTC prevents timezone drift caused by the JVM timezone where Etlworks is running.
When alternative.temporal.decoder is enabled, the connector uses internal date/time decoding instead of the AS400 JDBC driver. This is recommended if your database stores very old dates or if you require strict microsecond precision.
Capturing Journal Metadata
Etlworks can capture additional metadata from the AS400 journal along with the actual table row changes. Instead of adding hardcoded columns in the connector, this functionality is implemented using Extra Columns.
The following metadata attributes are available:
-
offset.event_sequence
The journal sequence number of the entry. This uniquely identifies the entry within a journal receiver and is useful for deduplication.
-
event.timestamp
The timestamp recorded by the journal for the change event.
-
journal.code
Journal entry code that identifies the general type of operation.
-
entry.type
Journal entry type (for example, insert, update, delete).
-
offset.receiver
The journal receiver name where the entry was recorded.
You can expose any of these attributes as columns in the destination table using the Extra Columns setting.
Example:
as400_event_sequence=long(offset.event_sequence),
as400_event_timestamp=timestamp(event.timestamp),
as400_journal_code=journal.code,
as400_entry_type=entry.type,
as400_offset_receiver=offset.receiver
In this example:
-
as400_event_sequence stores the numeric journal sequence.
-
as400_event_timestamp stores the journal timestamp as a timestamp column.
-
as400_journal_code, as400_entry_type, and as400_offset_receiver expose additional journal metadata.
You can remove any of these attributes and/or rename the target columns (left side of =).
If the destination table already exists, the new metadata columns will be added automatically on the next run as long as: “Alter target table if the source has columns that the target table doesn’t have” is enabled in the transformation.
In snapshot mode, most journal metadata fields will be empty or null because snapshot processing does not read journal entries.
Handling Deduplication
CDC guarantees “at least once” delivery. In certain scenarios, journal entries may be reprocessed.
To enable built-in deduplication add the following property under Other Parameters:
dedup.by.sequence=true
This enables deduplication by:
-
sequence number
-
receiver
-
receiver library
IMPORTANT: Built-in deduplication works only while the flow is running. It prevents processing journal entries with the same sequence number + receiver + receiver library more than once during execution.
Optional:
dedup.cache.size=number
Default is 1000. This is sufficient for most workloads.
MongoDB
Configuring MongoDB for CDC
Etlworks can stream real-time changes from MongoDB using native change streams. This works for insert, update, and delete operations on collections.
IMPORTANT: MongoDB must be configured as a replica set, even in single-node deployments. Standalone MongoDB servers are not supported for change data capture.
Prerequisites
-
MongoDB must run as a replica set or sharded cluster
CDC requires change streams, which are only supported on:
-
Replica sets
-
Sharded clusters (each shard must be a replica set)
-
-
Minimum MongoDB version: 3.6
Change streams were introduced in MongoDB 3.6. For pre-image support (capturing before values), MongoDB 6.0 or higher is required.
-
User privileges
The MongoDB user must have the following roles or privileges:
-
read on the target database
-
read on admin and config databases
-
listDatabases
-
find and changeStream at the cluster level (not needed if Capture scope = database)
(Use roles: [ { role: "readAnyDatabase", db: "admin" }, { role: "clusterMonitor", db: "admin" } ] as a base)
-
-
MongoDB Atlas
Etlworks supports CDC with MongoDB Atlas. Ensure your connection string uses SSL (mongodb+srv://) and the connector option Use SSL is enabled in Etlworks.
Step 1: Enable Replica Set (Local MongoDB)
If you’re running MongoDB locally or self-managed in a VM or container, configure it as a replica set:
1. Add replication section to mongod.conf (yaml):
replication:
replSetName: rs0
2. Restart the MongoDB server.
3. Initialize the replica set (js):
rs.initiate()
4. Verify status (js):
rs.status()
Step 2: Enable Pre-Images (Optional but Recommended)
To capture before values in updates and deletes (shown as before in Etlworks records), enable pre-image tracking per collection.
For each collection:
db.runCommand({
collMod: "your_collection_name",
changeStreamPreAndPostImages: { enabled: true }
})
Or at creation time:
db.createCollection("your_collection_name", {
changeStreamPreAndPostImages: { enabled: true }
})
Step 3: Optimize Oplog Retention (Highly Recommended)
Change streams rely on the oplog, a fixed-size log of recent operations. If Etlworks is offline for a period and the required oplog entries are purged, CDC cannot resume.
You can prevent data loss by:
Option A: Increase Oplog Size (all versions)
Use mongod startup option or reconfigure via admin command:
mongod --replSet rs0 --oplogSize 6144 # Size in MB (6GB example)
Option B: Enable Time-Based Oplog Retention (MongoDB 4.4+)
MongoDB 4.4+ supports minimum time-based oplog retention. Example:
mongod --replSet rs0 --oplogMinRetentionHours 12
This guarantees that oplog entries are retained for at least n hours, even if the maximum size is reached.
Sizing Example
If your database generates ~3GB of oplog per hour and Etlworks may be offline for up to 2 hours, set:
-
Oplog size = 3 GB/hour × 2 hours = 6 GB
-
Or use --oplogMinRetentionHours 2 (preferred)
MongoDB Snapshot Modes
Snapshot mode defines how and when the connector captures the current state of the source database.
- ad-hoc schema only: for tables in capture mode, the connector takes a snapshot of only the schema for the table. After the schema snapshot is completed, the connector continues by reading change events from the database’s redo logs. This mode allows adding new tables and triggering incremental snapshots at runtime.
- initial - the connector reads a snapshot when no offset is found. Uppon completion of the snapshot the connector begins to stream change events.
- ad-hoc initial - same as 'initial' but this mode allows adding new collections and triggering ad-hoc snapshot of new collections on startup and/or on-demand;
- never - the connector never snapshots the data and on startup immediately begins to stream change events;
- ad-hoc schema only - sames as 'never' but this mode allows adding new collections and triggering ad-hoc snapshot of new collections on startup and/or on-demand;
- initial_only: the connector runs a snapshot only when no offsets have been recorded for the logical server name and then stops, i.e. it will not read change events from the oplog.
Adding New Collections to the CDC Pipeline
-
Set snapshot mode to:
-
ad-hoc initial
-
ad-hoc
-
-
Enable Automatically trigger ad-hoc snapshot for new collections.
-
Stop the Flow.
-
Add collection names (not regex) to the FROM field.
-
Restart the Flow.
Supported Capture Modes
Etlworks supports the following MongoDB capture modes:
-
change_streams_update_full: change streams with full update payload.
-
change_streams: change streams without full update payload.
-
oplog: tailing the oplog (deprecated).
Serialize CDC Events as JSON
To preserve the original MongoDB document structure:
-
In the CDC connection:
-
Disable Serialize CDC events as CSV files
-
Enable Serialize CDC events as JSON files
-
Disable Close CSV file if header has changed
-
-
Enable all 3 options for handling JSON columns:
-
Flatten extended JSON (e.g., timestamps, ObjectIds)
-
Preserve nested fields
-
Output as valid JSON array
-
This configuration ensures compatibility with JSON destinations like Snowflake and supports storage in cloud or data lake environments.
Parallel Snapshot is a Beta Feature on MongoDB
The CDC connection has a Snapshot Max Threads field that controls how many tables the connector snapshots concurrently during the initial load phase. On relational sources, raising this value above 1 can cut snapshot time. On MongoDB sources, parallel snapshot is a beta feature and is not recommended for production use.
The underlying Debezium MongoDB connector's parallel snapshot path is known to corrupt the snapshot or fail with hard-to-diagnose errors under load. We have seen this both during the initial snapshot and during ad-hoc re-snapshots of existing collections.
What to do. Keep Snapshot Max Threads at 1 on every MongoDB CDC connection. The Etlworks Flow Findings inspector flags Snapshot Max Threads > 1 on a Mongo CDC source as a Major structural issue. See also CDC settings reference → Snapshot Max Threads.