When to use this transformation
Use this transformation to extract data from the source, transform it and load it into the destination.
One ETL flow or Any to Any ETL flow can include multiple source-to-destination transformations. Transformations are typically executed in order, but can also be run out-of-order if parallel execution is enabled.
Adding new source-to-destination transformation
To add a new source-to-destination transformation, select the flow type which starts with source to target (for example, web service to database) and click the
You are required to select a connection for the source and destination, as well as select or enter a source name (FROM) and a destination name (TO). Format is only required for transformations with data objects, such as web services or flat and structured files.
Source (FROM) and Destination (TO)
A typical transformation includes a source (FROM) and a destination (TO). The source and destination can be any data source supported by Integrator: tables in a relational database, SQL statements, files, web services, API endpoints, or social media websites.
- If the source or destination is a relational database, the name will be the name of the table, view, or synonym. Wildcard object names are supported. When configuring a FROM with a wildcard you must use a fully qualified object name:
- If the destination is a relational database and the FROM is configured with a wildcard, you can use
*as a part of the destination name (TO). The system will substitute the
*with the source name at run time.
- If the source or destination is in file storage (server or cloud), the name is an actual file name or a wildcard file name.
- If the source or destination is a web service, the name is any appropriate name. For example, patient.json.
- If the source or destination is an email connection, the name is either the name of the attached file or any appropriate name.
- If the source or destination is a key-value storage, such as Redis, the name is the key name.
- if the source or destination is a NoSQL database, such as MongoDB, the name is the document ID.
Programmatically set Destination Object Name (TO)
Read how to programmatically set a destination object name (TO) when the source is configured with a wildcard.
Selecting FROM and TO objects from the list.
If the source and/or destination are connected to the live connection, the system automatically populates available objects (tables, files, endpoints) from the specific connection. Use metadata selector to select an object from the list.
You can add new transformations, as well as reorder, disable or delete them. Disabled transformations will not be executed.
Most transformations include per-field mapping and additional parameters.
In Integrator, you can customize your
Source-to-Destination transformation by entering parameters.
To open the Parameters section while still on a MAPPING screen, select one of the following tabs:
On Exception, and
The actual list of available parameters and high-level transformations is dynamic and depends on the type of flow as well as the source and destination connections.
The Transformation name
Each source-to-destination transformation has either an automatically-generated or user-assigned name. In most cases, the name is generated automatically using the following algorithm:
" TO " +
INDEX. For example, if the FROM (source) is
employee.json and the TO (destination) is
employee.csv, and the index of the transformation is
1, the transformation name will be
employee.json TO employee.csv 1.
If you override the transformation name under the
mapping->parameters, this name will be used instead.
If you are setting the transfromation name manually, make sure it's name is unique within the flow. Otherwise, only the last transformation with this name will be executed.
Linked Source Name
By default, Integrator creates pairs of sources and destinations (From-To). Each source is linked to a single destination. It is possible to link a single source to multiple destinations by specifying this property. For example, let's assume that there is a source (From)
employee.json. You can link it to multiple destinations (To) by creating pairs of From-To transformations and setting the Linked Source Name to
employee.json for each of these pairs. The From part of each pair will be ignored and set to
Parameters used to improve performance
When this option is enabled (by default), Integrator will stream data from the source to the destination. When streaming is enabled, Integrator never reads the whole data set (table, view, etc.). Instead it reads data from the source in chunks and immediately writes data to the destination. In general, streaming is much faster and saves a lot of memory, so it is enabled by default. However, streaming is not always possible, so Integrator will make every attempt to detect situations where streaming won't work and automatically disables it. In rare cases, when Integrator cannot automatically disable streaming, you might need to do this manually, hence the streaming option.
Use Bind Variables
The fastest way to execute SQL queries is to use bind variables. When multiple, identical SQL statements are executed in batches and the only difference between them is the actual field value for each statement, it saves a lot of run time due to the fact that the SQL statement will only be prepared (compiled) once. Using bind variables also automatically enables streaming and batch processing. It is much faster when SQL statements are processed in batches. Integrator automatically sets the batch window to 100, but it is also possible to set the batch window manually using Batch Window field. The enabling of bind variables (enabled by default) is subject to the same restrictions as the enabling of streaming, so the same rules apply.
Our ETL engine is written in Java and uses JDBC protocol to connect to the databases. In JDBC, the setFetchSize(int) method is very important to the performance and memory-management within the JVM as it controls the number of network calls from the JVM to the database as well as the amount of RAM used for ResultSet processing. Many JDBC drivers (for example, Oracle) enforce the fetch size by default, while some others (for example MySQL and Redshift) do not. The Fetch Size field can be used to set the maximum number of records which will be retrieved in one database call, thus limiting the memory consumption within the JVM.
Setting Fetch Size at the transformation level overrides Fetch Size set the connection level.
By default, our ETL engine extracts all the records from the source, limited only by the source query or filter conditions. You can use this field to set the upper limit on the total number of records which can be extracted from the source.
Setting Max Records at the transformation level overrides Max Records set the connection level.
When this option is enabled (the default), the particular transformation will be executed in a parallel thread. From a practical standpoint, this means that Integrator can run multiple transformations in parallel, which is generally much faster. However, when certain transformations depend on others (for example, when the destination table "items" depends on the destination table "order") you might want to disable parallel execution.
Wait for Completion
Enable this option to set a synchronization point that allows flow to wait for the completion of the parallel transformations before proceeding to the next step. There can only be one synchronization point per destination connection.
In many cases, when the source is a file or response from a web service and the destination is a database table, the streaming (and the ability to use bind variables) will automatically be disabled by the engine, even when it is enabled for the specific transformation. For example, if the the source is XML or JSON document, or if the SQL queries used to filter out records in the response from the web service call.
In Integrator, it is possible to force streaming, even when it is not allowed, by enabling the option Force Streaming. When streaming is forced, the system automatically creates a stream-able view from a non-stream-able source. It still keeps all the records in memory, but does not keep all the SQL statements in memory and it is possible to use bind variables with a stream-able view.
Parameters available only when processing source objects matching the wildcard name
If the source (FROM) is configured with a wildcard, you can use this field to exclude specific database objects or files. Use comma (
,) to exclude multiple objects. The wildcard object names are supported in the exclusion list.
Note that you must use the fully qualified database object name in the exclusion list.
If the source is
public.* the system will include all tables and views in the schema
public. If you want to exclude tables and views which start with the prefix
account you can configure the following exclusion list:
To exclude all views enter
all views in the Exclude objects field.
To exclude all tables enter
all tables in the Exclude objects field.
If the source (FROM) is configured with a wildcard, you can use this field to include only specific database objects or files. Use comma (
,) to include multiple objects. The wildcard object names are supported in the inclusion list.
Note that you must use the fully qualified database object name in the inclusion list.
If the source is
public.* the system will include all tables and views in the schema
public. If you want to include tables and views which start with the prefix
account you can configure the following inclusion list:
Use Parallel Threads when processing sources by a wildcard
If the transformation is configured with a wildcard source name, this option controls whether using parallel threads for each source that matches the wildcard name is enabled.
Load data into the same destination sequentially
If the transformation is configured with a wildcard source name, and the use of the parallel threads for wildcard sources is enabled this option controls whether to load data into the same destination (for example database table) sequentially or in parallel. Loading data sequentially makes sense when you need to MERGE data in the same order it was updated in the source.
Enable this option if:
- You need to load files matching a wildcard into the database using MERGE (UPSERT) action.
- When loading files matching a wildcard into the Snowflake or Redshift using MERGE action.
- When loading files matching a wildcard using CDC MERGE action.
Parameters which control whether the transformation will be executed
Ignore when there are no records
If this parameter is enabled and the source data set has no records, the entire transformation will be ignored. For example, the file will not be loaded into the database, the API call will not be made, etc.
Parameters available only when the source is a file
Process all files
If this parameter is enabled and there are multiple files matching a wild-card file name, all matching files will be processed (extracted and loaded).
Ignore when there is no file
If this parameter is enabled and there are no source files, the entire transformation will be ignored. For example, the file will not be loaded into the database, the API call will not be made, etc.
Delete loaded sources files
If this parameter is enabled the system will delete the loaded source file.
If this parameter is enabled and the error occurred during the load the system will delete the source file.
Parameters available only when the destination is a database
Create target table if it doesn't exist. Also update columns to match the target
If this parameter is enabled (the default) and the destination table does not exist, Integrator will automatically create it. Also, when this option is enabled, the insert, update, and merge statements will only include fields that exist in the destination table, so it is recommended that you keep it enabled.
If this parameter is enabled (disabled by default) and the source table has indexes, those same indexes will be created for the destination table.
Alter target table if the source has columns that the target table doesn't have
If this parameter is enabled (it is disabled by default) and Create target table... parameter is also enabled the system will automatically create and execute ALTER TABLE SQL statement for each column that exists in the source and does not exist in the destination.
Recreate target table if the source has columns that the target table doesn't have
If this parameter is enabled (it is disabled by default) and Create target table... parameter is also enabled the system will automatically drop and recreate the destination table (if it exists) if the source table has changed.
This parameter is not available if High Watremerk Change replication is enabled.
The SQL to be executed on the destination connection before the extract. Optionally, you can configure the flow to Ignore Errors that occurred when executing the Before SQL.
The SQL to be executed on the destination connection after the load. Optionally, you can configure the flow to Ignore Errors that occurred when executing the After SQL.
Use Native SQL
If this option is enabled (disabled by default), the native SQL used for the destination connection SQL dialect (for example, PL/SQL for Oracle) will be used. It is generally slower to use native SQL because it automatically disables the ability to use bind variables.
SQL action (available only if the destination is a database)
By default, Integrator INSERTS records into the destination table. You can change this to
MERGE is natively supported for Oracle, MS SQL Server, PostgreSQL, MySQL, DB2, Informix, and Sybase databases.
The alternative to MERGE is the
IfExist action: Integrator will check to see if the record already exists, and if it does, it will execute
UPDATE; otherwise it executes
INSERT. The Action can be
Read more about MERGE (UPSERT) and IfExists actions.
Allows you to enter a comma-separated list of the fields in the destination table which will be used to uniquely identify a record, for the
IfExist action. Note that filed names are case sensitive.
If the action is
MERGE and the database is PostgreSQL or MySQL, there must be a unique index which includes all "lookup fields." For other databases, it is optional to have and enhances performance.
Predict Lookup Fields
If this option is selected and the Lookup Fields text field is empty, the system will try to predict the column(s) which uniquely identify the record.
The system uses the following algorithm to predict the lookup fields:
- If the destination is a database, the specific database supports indexes (not all databases do), and the destination database object does exist, the system will try to get either the primary key or the first suitable unique index from the destination object and use included column(s) as a Lookup Field(s).
- If everything above fails, the system will try to extract the Lookup Field(s) from the source, using exactly the same approach, except this time it will be using the source connection and the source database object name.
- If it does not work, either the system will look for the columns which names contain the part of the destination table name. For example, if the table name is
patientthe system will be looking for columns like
- Finally, if all the above fail, the system will simply use the first column in the destination database object as a Lookup Field.
It is important to understand that this approach does not guarantee that the correct column(s) will be selected as a value for the Lookup Field(s). But it can be the only solution when processing tables by a wildcard name, simply because there is just one source-to-destination transformation, which must adapt itself to the different pairs of source/destination objects.
Alternative to Predict Lookup Fields when using a wildcard processing
Alternatively to enabling the Predict Lookup Fields option (which is not always accurate) you can specify the list of table=fields pairs in the Lookup Field. Use the fully-qualified table names and ';' as a separator between table=field pairs.
MERGE) should be used.
Read how to configure a conditional action.
In Integrator, it is possible to create a flow that tracks changes in a data source. For example, database or API, and loads the changed records into the destination, for example another database or API. Another name for change replication is data synchronization.
Integrator supports unidirectional change replication: from the source to the destination. To replicate changes in the opposite direction -- from the destination to the source -- a separate flow must be created, where the source and the destination swap places.
Read more about configuring a change replication in Integrator.
High Watermark Field
A field used as a high watermark for the change replication.
Enable Change Replication
Enable or disable change replication. This works together with High Watermark Field.
High Watermark Field Value on First Run
The optional value of the high watermark field when running change replication scenario first time. If the high watermark field is a TIMESTAMP use <code>YYYY-MM-DD HH:mm:SS.mil</code> format, for example <code>2018-08-01 19:38:31.377</code>. The value set for this field will be used only when the change replication flow is running the first time and will be ignored on the next run
High Watermark Field Value
To make a change replication flow bulletproof, you can specify a SQL query which will be calculating the high watermark field value on the fly, based on the current state of the destination table. Basically, it will be pulling only the records in the source which do not exist in the destination or have changed in the source since the last run of the flow.
The parameters in this group can be found under the tab
By default, any error causes a halt of execution. When
ignore is selected, the errors will be ignored and execution will continue. Read how to configure the flow to ignore all or specific exceptions.
You can specify what errors should be ignored and still halt the execution for all other errors. Enter part or all of the exception string. This field works only when the Ignore On Exception option is selected.
Rollback on Exception
Some databases (for example, PostgreSQL) require transaction rollback for all exceptions.