Overview

A snapshot is how a CDC connector loads the current state of one or more source tables (or MongoDB collections) into the destination, so that subsequent change events can be applied on top of a complete baseline. Without a snapshot, the destination is missing every row that existed before the connector started; with a snapshot, the destination catches up to the source and then stays in sync via streaming.

Etlworks supports three snapshot types and two orthogonal modifiers that can be applied to one or more of the types. This article walks through each of them and explains when to use it, what changes when you do, and where the tradeoffs are.

Types vs. modifiers

The distinction matters because the two categories answer different questions:

A snapshot type answers when the snapshot runs and what triggers it. The three types are Full (Initial) Snapshot, Ad-hoc Snapshot, and Incremental Snapshot. Every CDC flow runs at least one snapshot type at some point, and the choice determines the flow's lifecycle.
A snapshot modifier answers how the snapshot is executed once it starts. The two modifiers are Parallel Snapshot and Partial Snapshot. Modifiers do not replace a type; they change how a Full or Ad-hoc snapshot copies data. They do not apply to Incremental Snapshot, which has its own built-in chunking mechanism.

A concrete example: a MySQL pipeline that bootstraps by copying the last 90 days of the orders table across four threads is using the Full (Initial) type with both the Partial modifier (only recent rows) and the Parallel modifier (four threads).

Choosing a snapshot approach

Match your goal to the entry in this table, then read the referenced section for tradeoffs.

Your goal	Snapshot type	Modifier(s)
Bootstrap a brand-new pipeline — load the current state of every monitored table, then keep it in sync	Full (Initial)	Optionally Parallel if the source has capacity (not on MongoDB); optionally Partial if you only need a subset
Add a new table to an already-running pipeline, once you have stopped it briefly	Ad-hoc, at restart	Optionally Parallel if the source has capacity (not on MongoDB); optionally Partial if you only need a subset
Add a new table without stopping the flow	Incremental	None
Reload an existing table without restarting the flow	Ad-hoc via signal	Optionally Parallel if the source has capacity (not on MongoDB); optionally Partial if you only need a subset
Load only recent rows from a very large table (for example, the last 90 days), then stream all subsequent changes	Full, ad-hoc	Partial (WHERE clause via snapshot.select.statement.overrides)
Do not snapshot; start streaming from the current source position	None — set Snapshot Mode to never	—
Capture schema but no data (for example, the destination will be seeded separately)	None — set Snapshot Mode to schema_only	—

The exact Snapshot Mode values that drive each type are covered in CDC settings reference → Snapshot mode.

Snapshot Types

Full (Initial) Snapshot

What it does. On the first startup of the CDC flow (and on any subsequent startup where the connector cannot find a usable offset), the connector opens a consistent read of every monitored table, streams every row into the destination, and then transitions to streaming mode — from that point forward it forwards change events from the source's log.

Where to configure. CDC connection, Snapshot Mode field. Values that produce a full snapshot include initial (snapshot then stream), ad-hoc initial (snapshot then stream, and enable ad-hoc snapshots for later), initial_only (snapshot and stop), and when_needed (snapshot only if no valid offset).

When to use. Every new pipeline where the destination needs a complete copy of the source's current state. This is the default and the right choice for the vast majority of setups.

Tradeoffs.

Snapshot duration scales with data volume. On a large source, the initial snapshot can take hours or days. During that time the destination is being populated but has not yet seen recent source changes; the connector switches to streaming as soon as the snapshot completes.
The snapshot is resumable by default (9.6.9+). Progress is persisted as each table completes, and if the flow is stopped or interrupted mid-snapshot, the next start skips the tables already snapshotted and continues with the rest — see Resumable full snapshot below. In releases before 9.6.9, or when the option is explicitly disabled, the full snapshot restarts from the beginning after any interruption.
The source's log must retain enough history to cover the snapshot. While the snapshot is running, the source's binlog / WAL / archive log / oplog keeps growing. If retention is too short and log positions from before the snapshot completes get purged, the flow cannot transition to streaming cleanly and will report a "position not found" error on restart. Configure log retention with a margin that comfortably exceeds the expected snapshot duration.
Locking and load. On most sources the snapshot opens a long-running consistent read, which can hold locks or delay VACUUM/log truncation. The database-specific configuration guides in Database-Specific CDC Cases cover the mitigations per engine.

Applicable modifiers. Parallel (except on MongoDB) and Partial can both be applied.

What can trigger full automatic re-snapshot of all tables

Connector	Case	Common cause	Snapshot type / result	Notes
All CDC connectors	Offset file is missing	File deleted File not restored from backup, App data folder changed, Tenant folder changed, Connection renamed when file name is auto-generated	initial, ad-hoc initial, when_needed, or initial_only	The connector treats this as no previous state. If the mode includes data snapshot, it snapshots again
All CDC connectors	Offset file is empty	Zero-byte file	initial, ad-hoc initial, when_needed, or initial_only	Empty offset is treated like no offset
All CDC connectors	Offset file exists but cannot be read	File permission issue, wrong owner, locked file, unreadable disk path	No automatic snapshot. Startup fails.	Important: unreadable is not the same as missing. The connector does not fall back to snapshot.
All CDC connectors	Offset file exists but is corrupt	Edited manually, invalid file content, partial binary/object data, copied from wrong connection	No automatic snapshot. Startup fails.	Support should restore a valid backup or intentionally remove/recreate state after confirming impact.
All CDC connectors	Connection was renamed and offset/history names are auto-generated	Connection Name changed in UI, cloned connection renamed	initial, ad-hoc initial, when_needed, or initial_only	Explicit offset/history file names prevent this class of accidental re-snapshot.
All CDC connectors	App data folder changed	Self-hosted migration, container volume changed, tenant data folder mismatch	initial, ad-hoc initial, when_needed, or initial_only	Connector cannot find previous state files, so it behaves like first run.
MySQL	Saved binlog position is no longer available	Binlogs purged, RDS retention too short, long downtime	when_needed or ad-hoc initial	These modes can automatically snapshot again when the previous binlog/GTID position cannot be used. Plain initial normally fails instead.
PostgreSQL	Snapshot Mode is always	Config explicitly set to always snapshot on startup	always	Snapshot runs on every restart by design.
Oracle	Snapshot Mode is always, if configured	Config explicitly set to always snapshot on startup	always	Snapshot runs on every restart by design.
MongoDB	Saved resume token is rejected during startup validation	Oplog/change stream history no longer contains the resume point, token is invalid, token belongs to incompatible stream state	initial or ad-hoc initial snapshot for affected replica set / monitored collections	Uppon finishing the snapshot connector can decide the saved token is not usable and snapshot again. This is an automatic re-snapshot risk. If the token problem is detected later during streaming, the flow will fail instead.
Relational connectors	Offset exists but history file is missing	History file deleted, backup missed it, wrong generated history file name	No automatic data snapshot. Startup fails or requires schema recovery.	This does not automatically trigger initial again because the offset still says previous state exists.
Relational connectors	History file exists but cannot be read or is corrupt	Permission issue, bad manual edit, partial restore	No automatic data snapshot. Startup fails or requires schema recovery.	For MySQL, schema recovery modes can rebuild schema history, but that is not a data re-snapshot.

Resumable full snapshot (default, 9.6.9+)

Starting in Etlworks 9.6.9, the full (initial) snapshot is resumable at the table level: if the snapshot is interrupted before it completes, the next start skips the tables that were already snapshotted and continues with the ones that were not, then transitions to streaming from the source log position recorded at the very start of the original snapshot.

Resumability is enabled by default on every CDC connection. The related settings are:

UI label — Resume interrupted full snapshot on the CDC connection.
Connector property — debezium.resumable.full.snapshot.enabled. Default value: true.

Behavior in detail. When the connector begins a full snapshot, it records the current source log position — the binlog file and position on MySQL, the LSN on PostgreSQL and SQL Server, the SCN on Oracle, the change position on DB2 and AS/400, the resume token or oplog timestamp on MongoDB. As each table finishes, its completion is persisted to the snapshot progress file. If the snapshot is interrupted for any reason — the flow stops, the integrator restarts, the connection is lost — the persisted progress survives.

On restart the connector begins a full snapshot again, checks the progress file, skips the tables already recorded as complete, and snapshots the remaining tables. After the resumed snapshot finishes, streaming picks up from the log position recorded at the start of the original snapshot — not the current head of the log — which preserves CDC correctness by ensuring no change events between the original snapshot start and the resumed streaming are lost.

Example. A CDC flow monitors five tables. The full snapshot completes tables 1–4 and fails on the fifth. On restart, Etlworks skips tables 1–4, snapshots table 5, and then resumes streaming from the log position recorded at the start of the original snapshot. In releases before 9.6.9, or with resumability disabled, all five tables would have been snapshotted from scratch.

Why this is useful. Large-table snapshots can take hours. Losing them to a transient failure — a network blip, a host restart, an out-of-memory condition — previously meant starting the entire snapshot over. Resumable full snapshots eliminate that cost, reduce recovery time, and reduce load on both the source database and the destination.

Correctness note: the original log position must still be available. If the source database has advanced its log retention past the position recorded when the original snapshot began, the connector cannot resume streaming from there without losing events. In that case Etlworks logs the condition clearly and, depending on the connector's configured failure behavior, either starts a fresh full snapshot or fails the flow. Configure source log retention with a margin that comfortably exceeds the expected total snapshot duration, including retries.

Edge case: very fast parallel snapshots. When parallel snapshot is enabled and individual tables finish very quickly, it is possible for the flow to fail before the completion of those fast tables has been persisted to the progress file. Restart then re-snapshots more tables than expected, because only persisted completions are honored. This is most visible with small tables and high Snapshot Max Threads values. If you observe it, either lower Snapshot Max Threads or accept the additional re-snapshot work.

Disabling resumable full snapshots. On rare occasions — for example, when reproducing a support case, or when the progress file itself is suspect — you may want the pre-9.6.9 behavior of always restarting the snapshot from scratch. Clear the Resume interrupted full snapshot checkbox on the connection, or set debezium.resumable.full.snapshot.enabled to false in Other Parameters. In that mode any interruption discards partial progress.

Ad-hoc Snapshot

What it does. Once the initial snapshot has completed and the connector is streaming, an ad-hoc snapshot re-runs the snapshot phase for one or more specific tables without resetting the whole pipeline. The connector pauses streaming briefly, snapshots the requested tables, then resumes streaming from where it left off.

Where to configure. CDC connection, Snapshot Mode field. Any value with the ad-hoc prefix (for example ad-hoc initial or ad-hoc schema_only) enables ad-hoc snapshots. Without an ad-hoc mode set, this feature is disabled.

When to use.

To add a new table to the pipeline (see Add New Tables below).
To reload an existing table without a full pipeline reset (see Reload Existing Tables below).
To recover a specific table whose downstream state has drifted from the source.

Tradeoffs.

Ad-hoc snapshots are resumable at the table level (9.6.8+). If a blocking ad-hoc snapshot fails after some of the requested tables have finished, restarting the flow retries only the remaining tables — see Resumable ad-hoc snapshot below. Within a single table, however, the snapshot restarts from the beginning of that table. For very large individual tables where mid-table resumability matters, prefer Incremental Snapshot.
Streaming is paused during the ad-hoc snapshot. New change events on the tables under snapshot accumulate in the source's log and are replayed once the snapshot completes. On busy tables the pause introduces measurable latency.
The FROM field must be an explicit list of tables. Ad-hoc snapshots operate on specific table names. When FROM is a wildcard, an empty value, or a regex, the mechanism cannot resolve which table to target — see Wildcard or regex FROM patterns disable ad-hoc snapshots below.

Applicable modifiers. Parallel (except on MongoDB) can be applied to speed up ad-hoc snapshots that cover multiple tables. Partial is not typically applied to ad-hoc snapshots.

Wildcard or regex FROM patterns disable ad-hoc snapshots

Ad-hoc snapshots operate on an explicit list of tables. The mechanism that lets you add a new table or reload an existing one requires the connector to know in advance which tables are in scope. When the CDC flow's FROM field is empty, set to the wildcard *, or contains a regular expression, the connector cannot resolve ad-hoc requests to specific tables and the feature is effectively disabled.

A wildcard or regex FROM also has a second, less obvious failure mode. The set of tables the pattern matches is recomputed every time the connector starts — on every flow restart, on every node failover, and during disaster recovery. If a new table that matches the pattern was created in the source database since the last start, the connector may decide to perform a full re-snapshot on the new members, even though, from the operator's point of view, nothing was changed. The same happens when a table is renamed in a way that changes whether it matches.

The fix is to keep FROM explicit. Configure FROM as a comma-separated list of fully qualified table names, for example schema.orders, schema.customers, schema.line_items. When you need to add or remove a table, edit the FROM list and use ad-hoc snapshot (below). Etlworks' Flow Findings inspector flags any non-explicit FROM on a CDC source as a Major structural issue. See also CDC settings reference → FROM — list of tables to capture.

Add New Tables

To start capturing changes from a new table using an ad-hoc snapshot at restart:

CDC connection:

Ensure Snapshot Mode is set to one of the ad-hoc variants: ad-hoc initial, ad-hoc, or ad-hoc schema only.
Enable Automatically trigger ad-hoc snapshot for new tables.

CDC flow:

Stop the flow.
Add the new tables to the FROM field. Use fully qualified table names, comma-separated. Regular expressions are not supported for this workflow.
Restart the flow.

On restart the connector runs an ad-hoc snapshot of the newly added tables, then resumes streaming with the expanded capture set.

Reload Existing Tables

To re-snapshot specific tables at runtime, without stopping or restarting the flow, use a signal collection. This is an ad-hoc snapshot triggered through a signal table or file rather than through a restart.

CDC connection:

Set Snapshot Mode to one of the ad-hoc variants (ad-hoc initial, ad-hoc, or ad-hoc schema only).
Set Signal Data Collection to either the fully qualified name of a signal table (when using a database signal source) or a file name with extension (when using a file-based signal source).

CDC flow:

Include the signal connection (database or file) in the flow's Connections tab, named Signal.
Signal entries must use fully qualified table names; regular expressions are not supported.

How it works. The signal collection holds table names to re-snapshot. When the connector detects an entry, it performs the snapshot for that table and then removes the signal (deleting the row or the file). The flow does not need to be restarted.

Signal format. One column, one or more fully qualified table names, comma-separated. Example:

test.inventory, test.payment, test.customer

When to use signal-based reload vs. incremental snapshot. Signal-based ad-hoc is simpler to set up and does not require write access to the source database. If the tables you are reloading are small enough that the pause in streaming is acceptable, this is the right choice. For very large tables where a long pause is unacceptable, use Incremental Snapshot instead.

Resumable ad-hoc snapshot (9.6.8+)

Starting in Etlworks 9.6.8, ad-hoc snapshots are resumable at the table level: if a blocking ad-hoc snapshot fails after finishing some of the requested tables, restarting the flow retries the ad-hoc snapshot for the tables that were not completed, rather than falling back to a full re-snapshot of every monitored table.

There is no configuration to enable this behavior; it applies automatically to any blocking ad-hoc snapshot triggered through either the restart path (Add New Tables) or the signal path (Reload Existing Tables).

Example. A CDC flow monitors five tables. Two new tables are added to the FROM field and the flow is restarted, so Etlworks kicks off an ad-hoc snapshot for the two new tables. The snapshot completes the first new table and fails on the second. In releases before 9.6.8, the next start would have triggered a full snapshot of all five monitored tables. In 9.6.8 and later, the next start retries the ad-hoc snapshot for the remaining new table only, then resumes streaming.

Why this is useful. A single failed ad-hoc snapshot no longer wipes out the progress of the rest of the pipeline. Recovery time drops, source database load stays proportional to the actual work left to do, and the intended scope of the ad-hoc snapshot is preserved.

Fully clearing an incomplete ad-hoc snapshot. If you want to clear an incomplete ad-hoc snapshot instead of letting it resume — for example, if the reason it failed is no longer relevant — use Reset CDC State in the CDC offset editor. It removes the ad-hoc / resumable snapshot sidecar state files along with the offset and history files, after writing a backup zip you can restore from if needed. See Reset CDC State.

Operational note: signal-triggered ad-hoc snapshots. If the ad-hoc snapshot was originally triggered by a row in a database signal collection (see Reload Existing Tables), the recovery retry may be re-triggered internally using a file-based signal instead. This is expected behavior and does not require any action from the operator; the important outcome is that the incomplete blocking snapshot resumes without forcing a full snapshot.

Incremental Snapshot

What it does. An incremental snapshot runs alongside streaming rather than pausing it. The connector reads the source table in bounded chunks, and between chunks it continues to consume change events from the log. The result is a resumable snapshot that keeps streaming latency roughly constant regardless of the snapshot's total duration.

Where to configure. CDC connection, Signal Data Collection field (pointing at a signal table in the source database). Do not set Snapshot Mode to any of the ad-hoc variants when using incremental snapshot — incremental snapshot activates when Signal Data Collection is set on a non-ad-hoc mode.

When to use.

Very large tables where a Full or Ad-hoc snapshot would take long enough that streaming latency during the pause becomes a problem.
Adding a new table to a running pipeline without any pause in streaming.
Pipelines where uninterrupted streaming is a hard requirement for downstream consumers.

Tradeoffs.

Resumable. Unlike Full or Ad-hoc snapshots, an incremental snapshot survives interruption — the connector remembers which chunks it has completed and picks up from the next one on restart.
Modifiers do not apply. Neither Parallel nor Partial can be used with an incremental snapshot — it has its own internal chunking, and the WHERE-clause mechanism used by Partial is not supported by the incremental path.
See Production risks to understand before enabling below for the real cost of choosing this type.

Not the same as chunked parallel snapshot. Both an incremental snapshot and a chunked parallel snapshot read a table in ranges, but they solve different problems: an incremental snapshot runs alongside streaming and is resumable chunk by chunk, while a chunked parallel snapshot speeds up a full (paused) snapshot and is not resumable at the chunk level. Incremental snapshots have their own chunking mechanism, sized by Incremental Snapshot Chunk Size — the parallel snapshot settings do not affect them.

Production risks to understand before enabling

Incremental snapshots are powerful, but they have two production-level costs that the convenience can obscure.

Read/write access to the source database is required. The connector writes the signal markers to the source itself, in the signal table you create. On a regulated or audited source database, that may require a change request to grant write privileges to the CDC user.
On PostgreSQL specifically, an incremental snapshot can cause WAL segments to accumulate for the duration of the snapshot. PostgreSQL retains every WAL segment that any replication slot still references, and the connector's replication slot holds back the WAL position throughout the incremental snapshot. On busy production sources we have seen the WAL volume grow by many gigabytes per hour and run out of disk before the snapshot completes. Plan for the increased disk usage, monitor WAL retention closely, and consider running the incremental snapshot in a maintenance window if the source is high-traffic.

If neither of these tradeoffs is acceptable, use ad-hoc snapshots with an explicit Signal connection instead — see the Reload Existing Tables subsection above. Ad-hoc snapshots do not require write access to the source and do not have the WAL retention effect.

Setting up an incremental snapshot

Verify permissions. The CDC connection must use a database user with write access to the source database (needed to write signal markers).

Create the signal table in the source database:

CREATE TABLE db.schema.dbz_signal (
  id VARCHAR(64),
  type VARCHAR(32),
  data VARCHAR(2048)
);

Configure the CDC connection:
- Set Signal Data Collection to the fully qualified name of the signal table.
- Do not set Snapshot Mode to any ad-hoc variant.
- Add the signal table to the list of included tables (in the CDC flow or CDC connection).
Trigger a snapshot by inserting a row into the signal table:
```
INSERT INTO db.schema.dbz_signal
VALUES ('signal-1', 'execute-snapshot', '{"data-collections":
["inventory.dbo.orders", "inventory.dbo.item"]}');
```
The snapshot runs immediately if the flow is active, or on the next restart if it is not.

Snapshot Modifiers

Modifiers change how a snapshot executes without changing when it runs or what triggers it. Both modifiers apply to Full and (in Parallel's case) Ad-hoc snapshots; neither applies to Incremental Snapshot.

Parallel Snapshot

What it does. Splits the snapshot workload across multiple threads so that more than one unit of work runs concurrently instead of one at a time. Etlworks supports two forms of parallel snapshot:

Table-level parallel snapshot — multiple tables are read concurrently, one thread per table. This is the default form and is controlled by Snapshot Max Threads alone. It works on every CDC source except MongoDB, and is described in the rest of this section.
Chunked parallel snapshot — a single large table is divided into non-overlapping key ranges so that several snapshot threads can read that one table concurrently. It is opt-in, available only on a subset of CDC connectors, and covered in Chunked Parallel Snapshot below.

The tradeoffs, MongoDB restriction, and recommended thread values below apply to table-level parallel snapshot; the tradeoffs and recommended thread values apply to the chunked form as well.

Where to configure. CDC connection, Snapshot Max Threads field. The default value is 1 (sequential); any value greater than 1 enables parallel snapshot for that connection.

Applies to. Full (Initial) Snapshot and Ad-hoc Snapshot. Does not apply to Incremental Snapshot.

When to use. Only when all of the following are true:

The pipeline captures many tables at once (a two-table snapshot does not benefit from four threads).
The source database has spare CPU, I/O, and connection capacity to absorb the added load.
The integrator host has enough free heap to hold N threads' worth of buffered rows simultaneously.
The source is not MongoDB (see the warning below).

Use with caution. Even when all four conditions above hold, parallel snapshot is not a free speed-up. The tradeoffs are real and add up quickly with high thread counts.

Tradeoffs

Memory consumption on the integrator scales with thread count. Each thread independently holds a JDBC result-set buffer sized by the connection's JDBC Fetch Size field (default 10,000 rows). On wide tables (many columns, or columns holding large text or binary values) that buffer can be sizeable, and running four or eight threads at once can push the JVM into GC pressure or an OutOfMemoryError. This is the single most common reason parallel snapshots fail in production. See also CDC settings reference → JDBC Fetch Size.
Source database load multiplies. Each thread opens its own connection and issues its own consistent read against the source. A busy production database may not have the connection headroom or the buffer-cache capacity to serve N concurrent snapshot readers without impacting the live workload.
Log-retention pressure increases. Each thread holds its own consistent-read snapshot for the duration of its scan, and the source cannot truncate the binlog / WAL / archive log until the last one is done. On PostgreSQL this means the replication slot holds back the WAL that much longer; on MySQL it means the binlog files that much bigger; on Oracle it means the archive-log retention window that much wider.
Diagnosability suffers. When a parallel snapshot fails partway through, the failure may involve interactions between threads (contention, connection-pool exhaustion, memory pressure) that are harder to reproduce and reason about than a single-threaded snapshot.
The setting only affects the snapshot phase. Once the connector transitions to streaming, it uses a single capture thread regardless of this value. So parallelism only helps if the snapshot itself is the bottleneck.
Interaction with resumable full snapshot. When Resumable full snapshot is enabled (the default in 9.6.9+) and Snapshot Max Threads is set high, individual tables can finish so quickly that a failure occurs before their completion is persisted to the progress file. Restart then re-snapshots those tables even though they finished before the failure. If you see this in practice, lower Snapshot Max Threads or accept the extra re-snapshot work.

Never use parallel snapshot with a MongoDB source

On MongoDB CDC sources, parallel snapshot is a beta feature and is not safe for production. The underlying Debezium MongoDB connector's parallel snapshot path is known to corrupt the snapshot or fail with hard-to-diagnose errors under load. This is true both for initial snapshots and for ad-hoc re-snapshots of existing collections.

Keep Snapshot Max Threads at 1 on every MongoDB CDC connection. Etlworks' Flow Findings inspector flags Snapshot Max Threads > 1 on a Mongo CDC source as a Major structural issue.

Recommended values (non-MongoDB sources)

When all preconditions are satisfied and MongoDB is not the source, start conservatively and confirm the source and integrator can absorb the load before raising further:

Situation	Suggested Snapshot Max Threads
Default, or any source where you are unsure	1 (sequential)
Non-MongoDB source, small tables (mostly under a million rows), source has capacity, few tables	1–2
Non-MongoDB source, many medium tables, source has confirmed spare capacity, integrator has ample heap	2–4
Non-MongoDB source, dedicated snapshot window, source and integrator both sized for it	Up to CPU cores available, tested case by case
MongoDB source (any scenario)	1 (never above)

Watch integrator heap usage and source database load during the first parallel snapshot run and adjust downward if either is stressed. See also CDC settings reference → Snapshot Max Threads.

Chunked Parallel Snapshot

What it does. Chunked parallel snapshot is the second form of parallel snapshot. Instead of assigning one thread per table, it divides a single large table into non-overlapping key ranges (chunks) and hands those chunks to the shared snapshot worker pool, so several threads can read the same table at once. Table-level parallelism still applies at the same time: multiple tables and multiple chunks share one pool.

Note — availability. Chunked parallel snapshots are available in Etlworks 9.7.1 or newer. For Flows running on an on-premise or customer-managed Integration Agent, the agent must also be upgraded to the version included with Etlworks 9.7.1 or newer. Updating only the Etlworks application does not enable this functionality on an older Integration Agent.

Supported connectors. Chunked parallel snapshots work on Oracle, Microsoft SQL Server, AS400/IBM i, MySQL, and PostgreSQL CDC sources. DB2 LUW and MongoDB are not supported — those connectors ignore the chunked path and keep their existing snapshot behavior. All new settings are disabled by default.

Applies to. Full (initial) snapshots and blocking ad-hoc snapshots. The ad-hoc snapshot Etlworks runs for newly added tables and the ad-hoc reload of existing tables both run as blocking ad-hoc snapshots, so they can use chunked parallel processing too. It does not apply to Debezium incremental snapshots, which have their own separate chunking mechanism controlled by Incremental Snapshot Chunk Size, not by the parallel snapshot settings.

Where to configure. CDC connection, in the Snapshot section. Three settings work together:

Snapshot Max Threads — existing setting. The maximum number of snapshot workers that execute concurrently. Default 1. This is the ceiling on real parallelism for both table-level and chunked snapshots.
Chunk large tables across snapshot threads — new checkbox that enables chunked parallel snapshots. Default disabled. Internal property snapshot.chunked.parallel.enabled.
Snapshot Thread Multiplier — new numeric setting. Default 1. Internal property snapshot.max.threads.multiplier. Controls how many chunks are created per snapshot worker. It does not create additional worker threads and does not open additional concurrent database connections beyond Snapshot Max Threads — Snapshot Max Threads alone controls worker concurrency.

Important: enabling Chunk large tables across snapshot threads while leaving Snapshot Max Threads at 1 normally provides no parallel performance benefit — a single worker still processes one chunk at a time. Raise Snapshot Max Threads above 1 for chunking to help.

How the multiplier works. With Snapshot Max Threads = 4 and Snapshot Thread Multiplier = 1, Etlworks creates approximately 4 chunks for each eligible table. With the multiplier set to 2, it creates approximately 8 chunks, but only 4 still execute concurrently — the extra chunks queue. Queuing more, smaller chunks can improve worker utilization when chunks take uneven amounts of time to read, because a worker that finishes early picks up a queued chunk instead of sitting idle. Start with multiplier 1 and increase it only after testing shows uneven chunk execution or idle snapshot workers.

Benefits.

A single large table can use more than one snapshot thread instead of being limited to one.
Snapshot workers stay busy when the captured tables have very different sizes.
The initial snapshot can finish faster when one or a few large tables dominate the workload.
The connector reaches the continuous CDC streaming phase sooner.
Existing table-level parallelism is retained — multiple tables and multiple chunks share the same worker pool.
The feature is opt-in and preserves existing snapshot behavior when disabled.
Tables that cannot be chunked fall back to a single chunk automatically instead of failing.

How it works

Etlworks counts the rows in each captured table.
For an eligible table, Etlworks orders rows by custom message key columns configured with message.key.columns, when a matching configuration exists; otherwise, it uses the table's database primary key.
It calculates key boundaries and divides the table into non-overlapping ranges (chunks).
The chunks are processed by the shared snapshot worker pool, alongside chunks and tables from the rest of the snapshot.
After the snapshot finishes, the connector transitions to normal CDC streaming.

Note: Do not confuse message.key.columns with the CDC Key field. CDC Key controls file or topic naming and is not used for snapshot chunking.

Both single-column and composite keys are supported, whether they come from the database primary key or from message.key.columns. A chunked snapshot reads the source table only — it does not partition, modify, or write anything to the source, and it does not change how DML is captured once streaming begins.

When to use it

Consider chunked parallel snapshots when:

The initial snapshot contains one very large table.
There are only a few tables, so table-level parallelism alone cannot use all configured snapshot threads.
Table sizes are highly uneven and one table determines the total snapshot duration.
The source database and Integration Agent have capacity for multiple concurrent reads and connections.
The large tables have a database primary key or a valid message.key.columns mapping.
Reducing initial snapshot duration matters more than minimizing temporary source-database load.

When not to use it

Leave it disabled, or test carefully first, when:

The snapshot consists mostly of small tables.
There are already enough similarly sized tables to keep all snapshot threads busy.
The source database has strict connection limits.
The source database is CPU-, I/O-, or memory-constrained.
Additional concurrent full-table reads could affect production workloads.
A table has neither a database primary key nor a valid message.key.columns mapping.
A table uses a custom snapshot SELECT override or a partial snapshot filter.
You need resumable progress within an individual table (chunked snapshots are not resumable at the chunk level — see Resumability).

Preparation overhead. Chunking is not free. Before reading data, Etlworks collects an exact row count for each table and runs ordered key-boundary queries to calculate the ranges, and more chunks mean more queries and more coordination. On small tables this overhead can outweigh any performance benefit.

Choosing a parallelism strategy

Situation	Recommended approach
Many similarly sized tables	Table-level parallelism (Snapshot Max Threads) is usually sufficient; chunking adds little.
One or a few very large keyed tables	Enable chunked parallelism.
Mixed small and very large tables	Use table-level plus chunked parallelism; raise Snapshot Thread Multiplier above 1 only if snapshot threads become idle.
Partial or filtered table (SELECT override)	That table is processed as a single chunk; other tables can still be chunked.
Incremental snapshot	Tune Incremental Snapshot Chunk Size, not the parallel snapshot settings.

When chunking is not used even when enabled

The checkbox is a request to use chunking where possible, not a guarantee that every table is split. Even with Chunk large tables across snapshot threads enabled, a table is processed as a single chunk when:

The table has neither a database primary key nor matching custom message key columns configured with message.key.columns.
It uses snapshot.select.statement.overrides (a partial snapshot).
A blocking ad-hoc request for it includes an additional condition or filter.
The calculated workload requires only one chunk — including empty or very small tables.

If Snapshot Max Threads is 1, chunks may still be calculated, but they are processed sequentially and provide no parallel speedup.

Chunking also does not apply at all when:

The source connector is not supported (DB2 LUW or MongoDB) — it ignores the chunked path and keeps its existing snapshot behavior.
The snapshot mode does not read table data, such as a schema-only snapshot.
The Flow is in the continuous CDC streaming phase rather than the snapshot phase.
An incremental snapshot is running — incremental snapshots use their own separate chunking mechanism, sized by Incremental Snapshot Chunk Size.
A table has already completed and is skipped by Resume interrupted full snapshot.

A table that cannot be chunked can still run concurrently with other tables through the existing table-level parallel snapshot mechanism.

Resumability

Resumability remains table-level. Chunked parallel snapshots are not resumable at the chunk level. With Resume interrupted full snapshot enabled:

Tables that completed before the interruption are skipped on restart.
If an interruption occurs before a large table completes, that table may be read again on retry.
Individual completed chunks are not persisted as independent resume points.

Not the same as an incremental snapshot. Both read a table in ranges, but they solve different problems: an incremental snapshot provides its own resumable, streaming-friendly mechanism, while a chunked parallel snapshot optimizes the speed of a full (paused) snapshot. Choose incremental when uninterrupted streaming and chunk-level resumability matter; choose chunked parallel when you want a full snapshot to finish faster.

Operational guidance

Begin with Snapshot Max Threads between 2 and 4 unless you have already established source capacity.
Start with Snapshot Thread Multiplier = 1.
Change one setting at a time.
Monitor source database CPU, I/O, active sessions, and query duration, along with Integration Agent utilization.
Benchmark with representative tables before enabling it against a busy production source.
Do not treat “CPU cores × 2” as a universal recommendation — source database capacity and connection limits matter just as much.

Example configuration.

Snapshot Max Threads: 4
Chunk large tables across snapshot threads: enabled
Snapshot Thread Multiplier: 1

This lets up to four snapshot chunks execute concurrently. A table that is not eligible for chunking remains a single unit of work and still runs alongside other tables through table-level parallelism.

Verification and troubleshooting

The Integration Agent logs report how the snapshot was chunked, so you can confirm the feature is doing what you expect. The logs record:

How many chunks were created for each table.
The size of the chunked snapshot worker pool.
When a table falls back to a single chunk because it has no key.
When a table falls back because it uses a snapshot SELECT override.

Look for messages along the lines of “Table will be processed in N snapshot chunks”, “Creating chunked snapshot worker pool with N worker threads”, “Table has no key columns. Using one snapshot chunk”, and “Table uses a snapshot select override. Using one snapshot chunk”. Exact wording and punctuation can vary between releases — match on the concept, not on an exact string.

Partial Snapshot

What it does. Replaces the default SELECT * FROM table that the connector would run during the Full snapshot with a custom SELECT that includes a WHERE clause. Only rows matching the WHERE clause are copied to the destination as part of the snapshot; subsequent inserts, updates, and deletes are streamed in full via the normal change feed.

Where to configure. CDC connection, Other Parameters. Two related keys:

snapshot.select.statement.overrides — comma-separated list of fully qualified tables that have a custom snapshot SELECT.
snapshot.select.statement.overrides.<table> — the actual SELECT statement to use for each such table.

Applies to. Full (Initial) and Ad-hoc Snapshots. Does not apply to Incremental Snapshot.

Not divided into parallel chunks. A table configured through snapshot.select.statement.overrides is never split into parallel chunks. Because Etlworks does not rewrite an arbitrary custom SELECT statement into key ranges, that table uses the existing single-chunk snapshot path even when chunked parallel snapshots are enabled. You can leave chunking enabled connection-wide — only the override tables fall back to a single chunk; other tables without a SELECT override are still eligible for chunking.

When to use.

The source table is large enough that a full snapshot is impractical, and only a recent subset (for example, the last 90 days) is needed at the destination.
The pipeline is intentionally scoped to only include qualifying rows — for example, only orders in a specific status, or only records for a specific business unit.
Testing or piloting a CDC pipeline against a small slice of production data before scaling to the full table.

Tradeoffs.

Rows excluded from the snapshot are not present in the destination. This is the single most important thing to understand. If your WHERE clause excludes rows that later receive updates, the destination will see the update event without the original row — which typically resolves to either an inserted row from the update (for MERGE-based loads) or a lost update (for INSERT-only loads). The behavior depends on the destination flow configuration; test it before running against production.
Deletes on excluded rows are still emitted. A delete event for a row that never existed in the destination is normally a no-op, but on strict destinations it can produce errors. Confirm the destination flow tolerates deletes for missing rows if this is a concern.
The WHERE clause only affects the snapshot phase. Once streaming starts, the connector captures every change on the table regardless of the WHERE clause. If your intent is to filter out rows at both snapshot and stream time, use a source-side filter (a view or an additional transformation), not the partial snapshot mechanism.
No easy way to "top up" excluded rows later. If you decide you need the older rows after all, the simplest recovery is to trigger an ad-hoc snapshot of the whole table (which will re-snapshot with the default SELECT unless the override is still active) or to reset the flow and re-run with the WHERE clause loosened or removed.

Setting up a partial snapshot

Set the two related keys in Other Parameters on the CDC connection. Example: snapshot only the last 90 days of the products and orders tables:

snapshot.select.statement.overrides=test.dbo.products,test.dbo.orders
snapshot.select.statement.overrides.test.dbo.products=SELECT * FROM test.dbo.products WHERE created_at > DATE_SUB(NOW(), INTERVAL 90 DAY)
snapshot.select.statement.overrides.test.dbo.orders=SELECT * FROM test.dbo.orders WHERE created_at > DATE_SUB(NOW(), INTERVAL 90 DAY)

The WHERE clauses use the source database's SQL dialect; adapt the syntax for PostgreSQL, SQL Server, Oracle, or DB2 as needed.

Articles in this section

Snapshot Management

Overview

Types vs. modifiers

Choosing a snapshot approach

Snapshot Types

Full (Initial) Snapshot

What can trigger full automatic re-snapshot of all tables

Resumable full snapshot (default, 9.6.9+)

Ad-hoc Snapshot

Wildcard or regex FROM patterns disable ad-hoc snapshots

Add New Tables

Reload Existing Tables

Resumable ad-hoc snapshot (9.6.8+)

Incremental Snapshot

Production risks to understand before enabling

Setting up an incremental snapshot

Snapshot Modifiers

Parallel Snapshot

Tradeoffs

Never use parallel snapshot with a MongoDB source

Recommended values (non-MongoDB sources)

Chunked Parallel Snapshot

How it works

When to use it

When not to use it

Choosing a parallelism strategy

When chunking is not used even when enabled

Resumability

Operational guidance

Verification and troubleshooting

Partial Snapshot

Setting up a partial snapshot

See also

Articles in this section

Overview

Types vs. modifiers

Choosing a snapshot approach

Snapshot Types

Full (Initial) Snapshot

What can trigger full automatic re-snapshot of all tables

Resumable full snapshot (default, 9.6.9+)

Ad-hoc Snapshot

Wildcard or regex FROM patterns disable ad-hoc snapshots

Add New Tables

Reload Existing Tables

Resumable ad-hoc snapshot (9.6.8+)

Incremental Snapshot

Production risks to understand before enabling

Setting up an incremental snapshot

Snapshot Modifiers

Parallel Snapshot

Tradeoffs

Never use parallel snapshot with a MongoDB source

Recommended values (non-MongoDB sources)

Chunked Parallel Snapshot

How it works

When to use it

When not to use it

Choosing a parallelism strategy

When chunking is not used even when enabled

Resumability

Operational guidance

Verification and troubleshooting

Partial Snapshot

Setting up a partial snapshot

See also

Related articles