Overview
High Watermark (HWM) replication is the most common way to implement incremental data loading in Etlworks.
It allows a Flow to load only records that have changed since the last successful run, instead of reloading the entire dataset every time.
HWM replication is unidirectional:
-
Data always flows from source to destination
-
Reverse replication requires a separate Flow with source and destination swapped
This technique works with:
-
Databases
-
Files
-
APIs
-
Any supported source and destination combination
The core idea is simple:
Etlworks remembers the highest value of a specific field from the last successful run and uses it to filter out older records in the next run.
That field is called the High Watermark Field.
If you are looking for other change replication methods available in Etlworks, see Change Replication Techniques.
What Problem HWM Replication Solves
Without HWM replication:
-
Every run reloads all data
-
Large datasets become slow and expensive
-
Reprocessing unchanged records increases risk and cost
With HWM replication:
-
Only new or updated records are extracted
-
Runs are faster and predictable
-
The process is safe to run repeatedly
Key Concepts You Must Understand
Before configuring HWM replication, it is important to understand these rules. Most user confusion comes from missing one of them.
1. HWM Is Tracked Per Transformation
Each source to destination transformation has its own HWM value.
-
HWM is not global to the Flow
-
HWM is not shared between transformations
-
HWM is stored using the transformation name as a key
This is why transformation naming is critical and discussed later.
2. A Valid High Watermark Field Is Required
HWM replication is possible only if the source contains a suitable High Watermark Field.
The field must:
-
Exist in the extracted data
-
Be of type DATE, TIMESTAMP, or NUMERIC (or convertible to one)
-
Represent record creation or modification order
If no suitable field exists in the source, HWM replication cannot be used.
3. HWM Is Updated Only After a Successful Run
Etlworks updates the stored HWM value only when the transformation finishes successfully.
If a run fails:
-
The HWM value is not updated
-
The next run will retry the same records
This guarantees no data loss.
The High Watermark Field
What Is a High Watermark Field
A High Watermark Field is the field Etlworks uses to determine whether a record is new or updated.
It represents a value that:
-
Always increases over time
-
Reflects record creation or modification
-
Can be compared across runs
Supported Data Types
The High Watermark Field must be one of the following:
-
TIMESTAMP
-
DATE
-
NUMERIC
If the value is not stored in one of these formats, it must be converted.
Recommended Characteristics
For reliable replication, the field should:
-
Be updated on every insert or update
-
Have an index if the source is a database
-
Never decrease in value
Common examples:
-
last_updated
-
modified_at
-
updated_on
-
sequence_id
-
audit_id
Using Fully Qualified Field Names
When the source query includes joins, subqueries, or aliases, field name collisions may occur.
In such cases, use a fully qualified field name for HWM field:
schema.table.column
Example:
public.audit_trail.last_modified
This avoids ambiguity and ensures correct filtering.
Using Multiple Fields as a High Watermark
Only one field can be configured as the High Watermark Field.
If multiple fields are required, combine them in the Source Query.
Example:
SELECT *,
COALESCE(updated_at, created_at) AS hwm_field
FROM orders
Then configure:
High Watermark Field = hwm_field
How HWM Replication Works Internally
This section explains exactly what Etlworks does.
First Successful Run
-
Etlworks extracts records from the source
-
It loads them into the destination
-
It calculates the maximum value of the High Watermark Field
-
That value is stored in metadata storage
At this point, the Flow has a baseline HWM.
Subsequent Runs
-
Etlworks retrieves the previously stored HWM value
-
It filters out records that are not newer than that value
-
Only new or updated records are processed
-
A new HWM value is calculated and stored
How Filtering Is Applied
Database source
Etlworks adds (or modifies) the WHERE clause dynamically:
SELECT *
FROM audit_trail
WHERE last_modified > ?
File or API source
Etlworks applies a JavaScript filter that skips older records.
Configuring HWM Change Replication
Step 1. Create the Flow
Create a Flow and add a source to destination transformation.
Step 2. Enable Change Replication
-
Open the transformation
-
Click CONFIGURE
-
Open the Change Replication tab
-
Enter the High Watermark Field
Step 3. Optional but Recommended: Calculate HWM on the Fly
This step is critical in environments where:
-
Destination tables already contain data
-
Destination data may be modified outside the Flow
-
You want protection against manual changes
Step 4. Handle First-Run Edge Case
If you calculate HWM using a destination query and the destination table does not exist yet:
-
Enable ignoring HWM query exceptions
-
Otherwise the first run will fail
Step 5. Configure Destination Action and Lookup Fields (Relational Databases Only)
If the destination is a relational database, configure how changed records are applied to the destination table.
Go to Transformation->Configure > Parameters, select the appropriate Action (for example, MERGE), and define the required Lookup Field(s).
For details read:
- Applying Changes to a Database Destination (Insert and Update)
- Lookup Fields (How Etlworks Identifies Existing Records)
Applying Changes to a Database Destination (Insert and Update)
When the destination is a database, HWM replication controls what records are extracted from the source. It does not automatically decide how to apply those records to an existing destination table.
You must configure the destination write behavior using the Action parameter.
What HWM Replication Supports
HWM replication supports:
-
Inserts (new records)
-
Updates (existing records)
HWM replication does not support:
-
Deletes
If you must replicate deletes, use a change-log based technique:
Choosing the Correct SQL Action
Default Behavior (Insert Only)
By default, Etlworks inserts records into the destination table.
This works only when one of the following is true:
-
The destination table is empty on the first run and never receives duplicates later, or
-
The destination table can safely accept duplicates, or
-
You are loading into a staging table and merging later outside Etlworks
If the destination already contains data and you want to keep it in sync, you typically need an upsert strategy.
Where to Configure the Action
Go to:
Configure > Parameters > Action
Available actions include:
-
INSERT (default)
-
UPDATE
-
MERGE (UPSERT)
-
BULK MERGE
-
BULK DELETE/INSERT
-
IfExist
-
CUSTOM
-
Conditional
Most of these require configuring lookup fields.
Recommended Actions for HWM Replication
Option 1 (Preferred): MERGE (Upsert)
MERGE updates existing rows and inserts missing rows.
MERGE is natively supported for these databases:
-
Oracle
-
Microsoft SQL Server
-
PostgreSQL
-
MySQL
-
DB2
-
Informix
-
Sybase
When to use MERGE:
-
You want one action that handles inserts and updates
-
The destination is expected to already contain some data
-
You want idempotent behavior across reruns
What MERGE does not do:
-
It does not delete rows that disappeared from the source
-
It does not detect conflicts where destination was modified independently. Source wins.
Option 2: BULK MERGE and BULK DELETE/INSERT
These are bulk strategies used primarily for performance at scale.
General behavior:
-
BULK MERGE: a bulk upsert strategy
-
BULK DELETE/INSERT: delete matching rows first, then insert the batch
Important notes for HWM replication:
-
These options still do not replicate deletes from the source.
-
BULK DELETE/INSERT can cause temporary gaps, extra logging, or lock amplification depending on the database and transaction settings.
-
Use them when performance matters more than per-row granularity.
Option 3: IfExist (Insert or Update)
IfExist is an alternative to MERGE.
Behavior:
-
Etlworks checks if the record exists using lookup fields
-
If it exists, Etlworks executes UPDATE
-
Otherwise, it executes INSERT
When to use IfExist:
-
MERGE is not supported by the destination database or driver
-
You want explicit insert/update logic
-
You want simpler SQL generated by the connector
Tradeoff:
-
It is much slower than MERGE for large batches because it requires existence checks
Option 4: CUSTOM (Custom Upsert Template)
Some connectors support database-specific upsert clauses, but syntax differs by driver and database.
With CUSTOM, you provide an upsert template, and Etlworks generates the runtime SQL.
When to use CUSTOM:
-
Destination supports upsert but not via standard MERGE
-
You need vendor-specific syntax (for example, special UPSERT or INSERT ON CONFLICT behavior)
-
You want full control over the statement
Option 5: UPDATE Only
UPDATE updates existing rows only.
When to use UPDATE:
-
You are certain all rows already exist in the destination
-
Inserts are handled separately
-
You are applying corrections only
If rows do not exist, they will not be created.
Conditional Action
Conditional action allows executing an action only when a JavaScript condition is true.
Use this when:
-
You want to route certain rows to different behavior
-
You need conditional update logic
-
You want to skip writes for certain records
This still requires lookup fields if any update-like action is used.
What This Means for HWM Replication
HWM controls extraction like this:
-
“Get records newer than the last stored HWM value”
Action controls how those extracted records are applied:
-
“Insert them”
-
“Update them”
-
“Insert or update them”
For typical incremental synchronization into a database table:
-
Use MERGE (preferred) or IfExist
-
Configure Lookup Field(s) to match the destination unique key
-
Do not expect delete replication. Use CDC or CT if deletes matter.
Lookup Fields (How Etlworks Identifies Existing Records)
Lookup fields define how Etlworks finds an existing row in the destination table when performing updates or upserts.
They are required for all actions that modify existing data, including:
-
UPDATE
-
MERGE (UPSERT)
-
BULK MERGE
-
BULK DELETE/INSERT
-
IfExist
-
CUSTOM
-
Conditional actions that perform updates
Without lookup fields, Etlworks has no reliable way to determine whether a record already exists.
What Lookup Fields Represent
Lookup fields must uniquely identify a single row in the destination table.
In most cases, lookup fields correspond to:
-
A primary key, or
-
A unique constraint, or
-
A natural business key (for example, order_id + source_system)
If lookup fields do not uniquely identify a row, Etlworks may:
-
Update multiple rows
-
Update the wrong row
-
Appear to behave inconsistently across runs
Rule of thumb:
Lookup fields should match a uniqueness guarantee in the destination database.
Simple Lookup Field Configuration
For a single destination table, lookup fields are usually specified as a comma-separated list of column names:
order_id
Or composite keys:
order_id, source_system
These column names must exist in the destination table and be present in the data being loaded.
Predict Lookup Fields (Automatic Detection)
Etlworks provides an option to Predict Lookup Fields.
When enabled, Etlworks attempts to automatically determine which columns uniquely identify a row using heuristics (first) and AI (if heuristics fail).
Important notes:
-
This is a convenience feature, not a guarantee
-
Automatic detection may be incorrect
-
It should be reviewed carefully before running production Flows
Note: Predict Lookup Fields and manually configured lookup fields can be enabled at the same time. Etlworks always applies manually defined lookup fields first and uses the predictive algorithm only as a fallback when no manual rules are provided.
Advanced Lookup Fields Configuration (Multiple Tables or Wildcards)
When working with:
-
Wildcard sources
-
Multiple destination tables
-
Dynamically generated transformations
You can explicitly map lookup fields per table.
Format:
table1=field1,field2;
table2=field1,field2;
Example:
public.inventory=inventory_id,database_name;
public.payment=payment_id,database_name;
public.rental=rental_id,database_name;
Rules:
-
Use fully qualified table names
-
Separate table mappings with semicolons
-
Fields listed after = form the composite lookup key for that table
This approach is strongly recommended for wildcard-based HWM replication when correct lookup fields cannot be reliable predicted.
Lookup Fields and HWM Replication
HWM replication controls which records are extracted.
Lookup fields control how those records are applied to the destination.
Both must be configured correctly for reliable incremental synchronization.
Common misconfiguration patterns:
-
HWM field is correct, but lookup fields are missing → duplicates
-
Lookup fields exist, but are not unique → incorrect updates
-
Lookup fields reference columns not present in the destination → runtime failures
Key Takeaway
Lookup fields are not optional metadata.
They are the mechanism that turns “changed records” into correct inserts or updates.
If HWM defines when a record should be processed, lookup fields define where it belongs in the destination table.
Getting them right is essential for correct HWM replication behavior.
Calculating High Watermark Value on the Fly Using SQL
Why This Matters
Automatic HWM tracking assumes:
-
Only this Flow modifies the destination table
If that assumption is false, HWM may drift or skip records.
Calculating HWM from the destination table fixes this.
IMPORTANT: this only works when destination is a relational database or cloud data warehouse, such as Snowflake.
How It Works
Provide an SQL query in High Watermark Field Value under Transformation > Configure > Change Replication that returns the current maximum HWM value from the destination.
Example:
SELECT MAX(last_updated) FROM orders
This query:
-
Runs against the destination connection
-
Overrides the internally tracked HWM value
Using the {TABLE} Token
When working with wildcard sources, use {TABLE} token:
SELECT MAX(id) FROM {TABLE}
Etlworks replaces {TABLE} with the actual destination table name.
Ignoring Exceptions When Executing HWM Query
The Edge Case
If:
-
The destination table is created automatically
-
The HWM query runs before the table exists
Then the Flow fails.
Solution
Set High Watermark Exception to Ignore.
Example value:
object does not exist
If the exception message contains this text:
-
HWM value is set to NULL
-
The Flow continues normally
Calculating HWM Value Using JavaScript
When to Use JavaScript
Use JavaScript when:
-
HWM value needs transformation
-
You need custom logic
-
Numeric or timestamp normalization is required
Provide a JavaScript expression in Change High Watermark Field Value under Transformation > Configure > Change Replication that returns the current maximum HWM value from the destination.
Example
highWatermark == null ? 0 : java.lang.Math.round(highWatermark)
Rules
-
The last evaluated expression is returned
-
NULL handling is your responsibility
Available Variables
You can reference:
-
dataSet
-
etlConfig
-
scenario
-
source
-
destination
These allow advanced logic when required.
Integrating High Watermark Logic into Custom Source Queries
Why This Section Exists
The most common and simplest HWM scenario is:
-
Source table is replicated as-is
-
Etlworks automatically adds the HWM filter
-
No custom SQL is involved
This is the happy path.
However, many real-world Flows do not extract data as-is.
Instead, users often define a custom source query to:
-
Join multiple tables
-
Filter records
-
Transform fields
-
Derive new columns
-
Normalize or denormalize data
Once you provide a custom source query, Etlworks can no longer safely modify it automatically.
At that point, you are responsible for explicitly integrating HWM logic into the query.
This section explains how to do that correctly.
The Core Rule
If a source query is explicitly defined, Etlworks does not automatically inject HWM filtering.
You must reference the previous High Watermark value manually inside the query.
Failing to do so means:
-
All records are extracted on every run
-
HWM is effectively ignored
-
Incremental replication does not occur
Where the Previous HWM Value Comes From
After each successful run, Etlworks stores the last calculated HWM value in metadata storage.
That value can be referenced using a token with the following format:
{TRANSFORMATION_NAME_HIGH_WATERMARK}
Example:
{ORDERS_HIGH_WATERMARK}
The token always resolves to:
-
The last stored HWM value, or
-
NULL if the Flow has never completed successfully
Bind Variables vs String Substitution
When a custom source query references the HWM token, Etlworks supports two ways of passing the value into the query:
-
String substitution (default, backward-compatible behavior)
-
Bind variables (optional, explicit behavior)
This behavior is controlled by the checkbox:
Use Bind Variables for High Watermark Values
When the option is disabled (default)
-
The HWM token is replaced directly in the SQL text
-
This preserves backward compatibility
-
This mode is required when the HWM value represents:
-
A list of IDs
-
A dynamically generated SQL fragment
-
Any non-scalar expression
-
Example use case:
WHERE order_id IN ({ORDERS_HIGH_WATERMARK})
at runtime replaced with:
WHERE order_id IN (1,2,3)
When the option is enabled
-
The HWM value is passed as a bind variable
-
This is safer and usually faster
-
This mode requires the HWM value to be a single scalar value (DATE, TIMESTAMP, or NUMERIC)
Example use case:
WHERE updated_at > {ORDER_HIGH_WATERMARK}
at runtime replaced with:
WHERE updated_at > ?
and bind variable value is set to the value of HWM.
Use this option when your HWM represents a single value and you want better performance and predictable SQL execution.
Handling the First Run Correctly (NULL HWM)
On the first run, there is no stored HWM value.
If your query does not account for this, it will often return zero rows or fail.
To handle this safely, always use the following pattern:
{TOKEN} IS NULL OR hwm_field > {TOKEN}
Example:
{ORDERS_HIGH_WATERMARK} IS NULL OR hwm_field > {ORDERS_HIGH_WATERMARK}
This guarantees that:
-
First run loads all data
-
Subsequent runs load only changes
IMPORTANT: pattern above works when Use Bind Variables for High Watermark Values is enabled.
Example: Adding HWM to an Existing Source Query
Original Source Query (No HWM)
SELECT o.id,
o.customer_id,
o.updated_at,
c.country
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'ACTIVE'
This query works, but it always reloads all matching records.
Updated Source Query with HWM Logic
Assumptions:
-
Transformation Name: ORDERS
-
High Watermark Field: o.updated_at
- Use Bind Variables for High Watermark Values is enabled
SELECT o.id,
o.customer_id,
o.updated_at,
c.country
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'ACTIVE'
AND (
{ORDERS_HIGH_WATERMARK} IS NULL
OR o.updated_at > {ORDERS_HIGH_WATERMARK})
Now the query behaves correctly:
-
First run loads all active orders
-
Subsequent runs load only newly updated orders
Using HWM in Queries with Derived Fields
If the HWM field is derived inside the query, reference the derived value consistently.
Example:
SELECT *,
COALESCE(updated_at, created_at) AS hwm_field
FROM orders
WHERE (
{ORDERS_HIGH_WATERMARK} IS NULL
OR COALESCE(updated_at, created_at) > {ORDERS_HIGH_WATERMARK}
)
The High Watermark Field must be set to:
hwm_field
Using the Previous HWM Value in HWM Calculation Queries
The same token can be used inside the HWM calculation query itself.
This is useful when you want to calculate the new HWM only from newly processed data, not from the entire destination table.
Example:
SELECT MAX(updated_at)
FROM orders_archive
WHERE (
{ORDERS_HIGH_WATERMARK} IS NULL
OR updated_at > {ORDERS_HIGH_WATERMARK}
)
This ensures the HWM advances only when new data is actually processed.
Common Mistakes to Avoid
-
Forgetting to add HWM logic to a custom source query
-
Using the HWM token without handling NULL
-
Referencing a field name that does not match the configured HWM field
-
Changing the transformation name after HWM values are already stored
Any of these will cause incorrect or confusing behavior.
Key Takeaway
Once you define a custom source query:
-
Etlworks stops modifying it automatically
-
HWM becomes explicit, not implicit
-
You must integrate HWM logic yourself
If done correctly, HWM replication works exactly the same as in the happy path scenario, just with full control over extraction logic.
If not, incremental replication will silently stop working.
Manually Setting the High Watermark Value
There are rare cases where you must control the initial or current HWM.
High Watermark Field Value on First Run
-
Used only once
-
Ignored on subsequent runs
-
Useful when destination already contains historical data
Timestamp format:
YYYY-MM-DD HH:mm:SS.mil
High Watermark Field Value
-
Overrides all automatic behavior
-
Can be a literal value or SQL query
-
Applies to every run
Use with caution.
HWM Replication with Wildcard Sources
Wildcard sources create multiple transformations dynamically.
Critical Rule
Transformation names must be stable.
Configuration
1. Use wildcard in source
2. Open Parameters tab
3. Set Transformation Name to:
{source}
This ensures:
-
Each source object gets its own HWM
-
Stored values can be located correctly
Source Queries in Wildcard Transformations
- Use {TABLE} in Source Queries
- Use {HWM_TOKEN} to reference Previous HWM value
Example:
select * from {table}
where {HWM_TOKEN} IS NULL or last_update > {HWM_TOKEN}
Other Notes
-
All source objects must share the same HWM field
- Use Bind Variables for High Watermark Values must be enabled
- Use {TABLE} in HWM SQL queries: select max(last_update) from {table}
Saving HWM Only on Full Success
The Problem
In nested flows:
-
Partial success may still update HWM
-
This can skip records unintentionally
Solution
Enable Save metrics only on success under Flow->Parameters
This ensures:
-
HWM is saved only if all steps succeed
-
Partial runs do not advance HWM
Incremental Replication for Non-Database Sources
HWM replication is not limited to databases.
Requirements:
-
High Watermark Field must be DATE, TIMESTAMP, or NUMERIC
-
Change replication must be enabled
Etlworks handles filtering automatically.
How HWM Replication with Non-Databases Sources Works
Flat Sources (CSV, Excel)
- Data is streamed from the source record by record
- The filter is applied for each record based on the value of the HWM field stored in Etlworks storage.
Nested Sources (JSON, XML)
- The entire dataset is extracted from the source into memory
- The filter is applied to entire dataset based on the value of the HWM field stored in Etlworks storage.
Conflict Resolution
If the same record is modified in both source and destination:
-
Source always wins
-
Destination changes are overwritten
HWM replication does not perform conflict detection.
Troubleshooting
HWM Was Reset Unexpectedly
HWM is reset if:
-
Source name changes
-
Destination name changes
-
Transformation order changes
-
Transformation is enabled or disabled
Reason
HWM is stored using the transformation name and index.
Solutions
Non-wildcard transformations:
-
Set a fixed transformation name
Wildcard transformations:
-
Use {source} or {destination}
Best solution:
-
Calculate HWM on the fly
Flow Fails on First Run with HWM Query
Reason:
-
Destination table does not exist yet
Solution:
-
Ignore HWM query exception
Recommendations
For reliable HWM replication:
-
Always set a stable transformation name
-
Prefer calculating HWM from destination state
-
Use Save metrics only on success for nested flows
-
Treat HWM as state, not magic