About HWM Change Replication
In Integrator, it is possible to create a flow that tracks changes in a data source, for example, database or API, and loads the changed records into the destination, for example, another database or API. Another name for change replication is data synchronization.
Integrator supports unidirectional change replication: from the source to the destination. To replicate changes in the opposite direction--from the destination to the source--a separate flow must be created, where the source and the destination swap places.
This technique uses a high watermark field to identify recently changed records.
Read about other change replication techniques available in Etlworks Integrator.
Change replication can occur:
Between two databases: a flow pulls changed records from the source database to update the destination database.
In incremental database backups: a flow pulls changed records from the source database and packages them into "update" files.
To pull recently changed records from an API and load them into any supported destination: a flow pulls changed records from the supported API or web service and loads them into any supported destination, such as a data warehouse.
Resolving change conflicts
When the same record is modified in both the source and destination, Integrator will simply apply the source changes to the destination.
A high watermark is the highest peak in value that a field has reached.
Which fields could become a high watermark field?
Basically, any field can be a high watermark field as long it has the data type TIMESTAMP, DATE or NUMERIC and can be used to uniquely identify recent changes.
For a database, it is recommended that there is an index for the high watermark field.
Is it possible to set two or more fields as a high watermark for a single transformation?
There can only be one high watermark field.
Suppose there is a table
audit_trail in a source PostgreSQL database. This table will be updated each time the user logs into the system or executes a specific function. We would like to track changes in this table and load them into an online data warehouse (which is also a database).
Suppose that the table
audit_trail has a field
Last_modified TIMESTAMP NOT NULL which is updated each time a record is inserted or updated.
We would use
Last_modified as the high watermark field, so we can track changes in the
How HWM change replication works
Etlworks Integrator always tracks what records have been extracted and what records have been loaded during a typical ETL.
It is possible to setup a High watermark field for that particular transformation.
When a high watermark field is set, Integrator calculates a maximum value for the field (high watermark) for the last successful load.
Integrator stores this value with the metadata.
The next time the ETL process is executed, Integrator uses the previously calculated high watermark value to filter out records which are older than the previous high watermark.
This technique works equally well for databases, files and APIs.
If the change replication is from a database, Integrator modifies the
WHEREclause on the fly, to select records which are newer than the high watermark:
select * from audit_trail where last_modified > ?-- where
?is a high watermark value.