When to use this transformation
Use this transformation to extract data from the source, transform it and load it into the destination.
A specific ETL Flow or Any to Any ETL Flow can include multiple source-to-destination transformations. Transformations are typically executed in order, but can also be run out-of-order if parallel execution is enabled.
Add new source-to-destination transformation
To add a new source-to-destination transformation, select the Flow type which starts with Source to target
(for example, web service to database
), and click +
.
You are required to select a Connection for the source and destination, as well as select or enter a source name (FROM
) and a destination name (TO
). Format is only required for transformations with data objects, such as web services or flat and structured files.
Read more about Mapping a source to a destination and Mapping source fields to destination fields.
Source (FROM) and Destination (TO)
A typical transformation includes a source (FROM
) and a destination (TO
). The source and destination can be any data source supported by Etlworks: tables in a relational database, SQL statements, files, web services, API endpoints, or social media websites.
- If the source or destination is a relational database, the name will be the name of the table, view, or synonym. Wildcard object names are supported. When configuring a
FROM
with a wildcard you must use a fully qualified object name:database.schema.wildcard
. - If the destination is a relational database and the
FROM
is configured with a wildcard, you can use*
as a part of the destination name (TO
). The system will substitute the*
with the source name at run time. - If the source or destination is in file storage (server or cloud), the name is an actual file name or a wildcard file name.
- If the source or destination is a web service, the name is any appropriate name. For example,
patient.json
. - If the source or destination is an email Connection, the name is either the name of the attached file or any appropriate name.
- If the source or destination is a key-value storage, such as Redis, the name is the key name.
- if the source or destination is a NoSQL database, such as MongoDB, the name is the document ID.
Programmatically set Destination Object Name (TO)
Read how to programmatically set a destination object name (TO
) when the source is configured with a wildcard.
Selecting FROM and TO objects from the list
If the source and/or destination are connected to the live Connection, the system automatically populates available objects (tables, files, endpoints) from the specific Connection. Use metadata selector to select an object from the list.
Manage transformations
You can add new transformations, as well as reorder, disable or delete them. Disabled transformations will not be executed.
Most transformations include per-field Mapping and additional parameters.
Source Query
The Source query
is used to create a source dataset for the source-to-destination transformation. The source query can be executed against any Connection, including relational databases, files, and web services.
To configure the source query, click MAPPING
, and enter the SQL in the Source query
field.
Source query is a script
When the source Connection is a relational database that supports stored procedures and anonymous SQL blocks, it is possible to run SQL scripts (with variables, conditions, loops, etc.) as a part of the Source query
. The only condition is that the script must return a cursor.
Enter the script in Source query
and enable the Execute as script
parameter.
Destination Query
When the destination is a relational database the Flow automatically creates and executes the destination query for each row in a source. The query can be INSERT
, UPDATE
, DELETE
, or MERGE
.
The developer can override the destination query.
Parameters
In Etlworks, you can customize your source-to-destination transformation by entering parameters.
To open the Parameters
section while still on a MAPPING
screen, select one of the following tabs:
Parameters
Additional Transformations
Complex Transformations
Change Replication
On Exception
Flow Control
The actual list of available parameters and high-level transformations is dynamic and depends on the type of Flow as well as the source and destination Connections.
Here's a quick list of all available parameters:
- Common parameters
- Parameters used to improve performance
- Parameters available only when processing source objects matching the wildcard name
- Parameters that control whether the transformation will be executed
- Parameters available only when the source is a file
- Parameters available only when the source is a database
- Parameters available only when the destination is a database
Common parameters
The transformation name
Each source-to-destination transformation has either an automatically-generated or user-assigned name. In most cases, the name is generated automatically using the following algorithm: FROM
+ " TO "
+ TO
+ INDEX
. For example, if the FROM
(source) is employee.json
and the TO
(destination) is employee.csv
, and the index of the transformation is 1
, the transformation name will be employee.json TO employee.csv 1
.
If you override the transformation name under the MAPPING
> Parameters
, this name will be used instead.
If you are setting the transfromation name manually, make sure it's name is unique within the Flow. Otherwise, only the last transformation with this name will be executed.
Linked sourcename
By default, Etlworks creates pairs of sources and destinations (FROM
- TO
). Each source is linked to a single destination. It is possible to link a single source to multiple destinations by specifying this property. For example, let's assume that there is a source (FROM
) employee.json
. You can link it to multiple destinations (TO
) by creating pairs of FROM
- TO
transformations and setting the Linked Source Name
to employee.json
for each of these pairs. The FROM
part of each pair will be ignored and set to employee.json
.
Parameters used to improve performance
Stream data
When this option is enabled (by default), Etlworks will stream data from the source to the destination. When streaming is enabled, Etlworks never reads the whole data set (table, view, etc.). Instead it reads data from the source in chunks and immediately writes data to the destination. In general, streaming is much faster and saves a lot of memory, so it is enabled by default. However, streaming is not always possible, so Etlworks will make every attempt to detect situations where streaming won't work and automatically disables it. In rare cases, when Etlworks cannot automatically disable streaming, you might need to do this manually, hence the streaming option.
Use bind variables
The fastest way to execute SQL queries is to use bind variables
. When multiple, identical SQL statements are executed in batches and the only difference between them is the actual field value for each statement, it saves a lot of run time due to the fact that the SQL statement will only be prepared (compiled) once. Using bind variables
also automatically enables streaming and batch processing. It is much faster when SQL statements are processed in batches. Etlworks automatically sets the batch window to 100
, but it is also possible to set the batch window manually using Batch Window
field. The enabling of bind variables
(enabled by default) is subject to the same restrictions as the enabling of streaming, so the same rules apply.
Fetch size
Our ETL engine is written in Java and uses JDBC protocol to connect to the databases. In JDBC, the setFetchSize(int)
method is very important to the performance and memory-management within the JVM as it controls the number of network calls from the JVM to the database as well as the amount of RAM used for ResultSet
processing. Many JDBC drivers (for example, Oracle) enforce the Fetch Size
by default, while some others (for example MySQL and Redshift) do not. The Fetch Size
field can be used to set the maximum number of records which will be retrieved in one database call, thus limiting the memory consumption within the JVM.
Setting Fetch Size
at the transformation level overrides Fetch Size set the Connection level.
Max records
By default, our ETL engine extracts all the records from the source, limited only by the Source query
or filter conditions. You can use this field to set the upper limit on the total number of records which can be extracted from the source.
Setting Max Records
at the transformation level overrides Max Records set the Connection level.
Parallel
When this option is enabled (the default), the particular transformation will be executed in a parallel thread. From a practical standpoint, this means that Etlworks can run multiple transformations in parallel, which is generally much faster. However, when certain transformations depend on others (for example, when the destination table items
depends on the destination table order
) you might want to disable parallel execution.
Wait for completion
Enable this option to set a synchronization point that allows Flow to wait for the completion of the parallel transformations before proceeding to the next step. There can only be one synchronization point per destination Connection.
Force Streaming
In many cases, when the source is a file or response from a web service and the destination is a database table, the streaming (and the ability to use bind variables) will automatically be disabled by the engine, even when it is enabled for the specific transformation. For example, if the the source is XML or JSON document, or if the SQL queries used to filter out records in the response from the web service call.
In Etlworks, it is possible to force streaming, even when it is not allowed, by enabling the option Force Streaming
. When streaming is forced, the system automatically creates a stream-able view from a non-stream-able source. It still keeps all the records in memory, but does not keep all the SQL statements in memory and it is possible to use bind variables with a stream-able view.
Parameters available only when processing source objects matching the wildcard name
Maximum Number of Files to Process
When processing the files matching the wildcard in the source (FROM
) you can use this field to limit the number of files to process. It could be useful if you expect that a large (many thousands) number of files would need to processed and don't want the Flow to be overwhelmed. Processing a smaller number of files (hundreds) in micro-batches will be more efficient.
Exclude Objects
If the source (FROM
) is configured with a wildcard, you can use this field to exclude specific database objects or files. Use comma (,
) to exclude multiple objects. The wildcard object names are supported in the exclusion list.
Note that you must use the fully qualified database object name in the exclusion list.
Example:
If the source is public.*
the system will include all tables and views in the schema public
. If you want to exclude tables and views which start with the prefix patient
or account
you can configure the following exclusion list: public.patient*,public.account*
.
To exclude all views enter all views
in the Exclude objects
field.
To exclude all tables enter all tables
in the Exclude objects
field.
Include Objects
If the source (FROM
) is configured with a wildcard, you can use this field to include only specific database objects or files. Use comma (,
) to include multiple objects. The wildcard object names are supported in the inclusion list.
Note that you must use the fully qualified database object name in the inclusion list.
Example:
If the source is public.*
the system will include all tables and views in the schema public
. If you want to include tables and views which start with the prefix patient
or account
you can configure the following inclusion list: public.patient*,public.account*
.
Use parallel threads when processing sources by a wildcard
If the transformation is configured with a wildcard source name, this option controls whether using parallel threads for each source that matches the wildcard name is enabled.
Load data into the same destination sequentially
If the transformation is configured with a wildcard source name, and the use of the parallel threads for wildcard sources is enabled this option controls whether to load data into the same destination (for example database table) sequentially or in parallel. Loading data sequentially makes sense when you need to MERGE
data in the same order it was updated in the source.
Enable this option if:
- You need to load files matching a wildcard into the database using MERGE (UPSERT) action.
- When loading files matching a wildcard into the Snowflake or Redshift using
MERGE
action. - When loading files matching a wildcard using
CDC MERGE
action.
Parameters that control whether the transformation will be executed
Ignore when there are no records
If this parameter is enabled and the source data set has no records, the entire transformation will be ignored. For example, the file will not be loaded into the database, the API call will not be made, etc.
Parameters available only when the source is a file
Process all files
If this parameter is enabled and there are multiple files matching a wild-card file name, all matching files will be processed (extracted and loaded).
Ignore when there is no file
If this parameter is enabled and there are no source files, the entire transformation will be ignored. For example, the file will not be loaded into the database, the API call will not be made, etc.
Delete loaded sources files
If this parameter is enabled the system will delete the loaded source file.
If this parameter is enabled and the error occurred during the load the system will delete the source file.
Parameters available only when the source is a database
If this option is enabled a new database Connection will be created for each source.
Enabling this option can improve performance when the database driver does not support parallel extract from different sources using the same database Connection.
Parameters available only when the destination is a database
Create new database Connection for destination
If this option is enabled a new physical database Connection will be created for each destination even when the same destination Connection is configured for each pair. It does include loading data into multiple tables using the wildcard source name.
Enabling this option can improve performance and resolve issues with the handling of transactions.
Create target table if it doesn't exist. Also update columns to match the target
If this parameter is enabled (the default) and the destination table does not exist, Etlworks will automatically create it. Also, when this option is enabled, the INSERT
, UPDATE
, and MERGE
statements will only include fields that exist in the destination table, so it is recommended that you keep it enabled.
Create temporary table
If this parameter is enabled and Create target table
parameter also enabled the Flow will create a temporary destination table (create temporary table xyz). The target database must support temporary tables. The temp tables can be used to stage data before moving data to the actual tables.
Create table SQL
By default system automatically generates an SQL for creating a new table if it does not exist. You can override the default SQL by providing a template, for example:
create global temporary table {TABLE} ({COLUMNS}})
Available tokens: {TABLE}
and {COLUMNS}
.
You can also write a program in JavaScript to dynamically generate the create table SQL, for example:
value = 'create global temporary table ' + table + '(' + columns + ', primary key id)';
Available variables:
- etlConfig
- scenario
- dataSet
- destination
- table: the table name
- columns: the comma-separated list of columns
Create indexes
If this parameter is enabled (disabled by default) and the source table has indexes, those same indexes will be created for the destination table.
Alter target table if the source has columns that the target table doesn't have
If this parameter is enabled (it is disabled by default) and Create target table
parameter is also enabled the system will automatically create and execute ALTER TABLE SQL
statement for each column that exists in the source and does not exist in the destination.
Recreate target table if the source has columns that the target table doesn't have
If this parameter is enabled (it is disabled by default) and Create target table
parameter is also enabled the system will automatically drop and recreate the destination table (if it exists) if the source table has changed.
This parameter is not available if High Watermark Change Replication
is enabled.
Before SQL
The SQL to be executed on the destination Connection before the extract. Optionally, you can configure the Flow to Ignore Errors
that occurred when executing the Before SQL
.
After SQL
The SQL to be executed on the destination Connection after the load. Optionally, you can configure the Flow to Ignore Errors
that occurred when executing the After SQL
.
Use native SQL
If this option is enabled (disabled by default), the native SQL used for the destination Connection SQL dialect (for example, PL/SQL for Oracle) will be used. It is generally slower to use native SQL because it automatically disables the ability to use bind variables
.
SQL action (available only if the destination is a database)
Action
By default, Etlworks INSERT
records into the destination table. You can change this to UPDATE
, DELETE
, MERGE
(UPSERT), IfExist
, or Conditional
.
MERGE
is natively supported for Oracle, MS SQL Server, PostgreSQL, MySQL, DB2, Informix, and Sybase databases.
The alternative to MERGE
is the IfExist
action: Etlworks will check to see if the record already exists, and if it does, it will execute UPDATE
; otherwise it executes INSERT
. The Action
can be conditional
, which requires entering JavaScript conditions into the Action Conditions
fields.
Read more about MERGE
(UPSERT) and IfExist
actions.
Lookup Fields
Allows you to enter a comma-separated list of the fields in the destination table which will be used to uniquely identify a record, for the UPDATE
, DELETE
, MERGE
, or IfExist
action. Note that filed names are case sensitive.
If the action is MERGE
and the database is PostgreSQL or MySQL, there must be a unique index which includes all Lookup Fields
. For other databases, it is optional to have and enhances performance.
Predict Lookup Fields
If this option is selected and the Lookup Fields
text field is empty, the system will try to predict the column(s) which uniquely identify the record.
The system uses the following algorithm to predict the Lookup Fields
:
- If the destination is a database, the specific database supports indexes (not all databases do), and the destination database object does exist, the system will try to get either the
Primary Key
or the first suitableUnique Index
from the destination object and use included column(s) as aLookup Fields
. - If everything above fails, the system will try to extract the
Lookup Fields
from the source, using exactly the same approach, except this time it will be using the source Connection and the source database object name. - If it does not work, either the system will look for the columns which names contain the part of the destination table name. For example, if the table name is
patient
the system will be looking for columns likepatient_id
,patientid
, etc. - Finally, if all the above fail, the system will simply use the first column in the destination database object as a
Lookup Fields
.
It is important to understand that this approach does not guarantee that the correct column(s) will be selected as a value for the Lookup Fields
. But it can be the only solution when processing tables by a wildcard name, simply because there is just one source-to-destination transformation, which must adapt itself to the different pairs of source/destination objects.
Alternative to Predict Lookup Fields when using a wildcard processing
Alternatively to enabling the Predict Lookup Fields
option (which is not always accurate) you can specify the list of table=field
pairs in the Lookup Fields
. Use the fully-qualified table names and ;
as a separator between table=field
pairs.
Example:
test1.inventory=inventory_id,database_name;
test1.payment=payment_id,database_name;
test1.rental=rental_id,database_name;
Action Conditions
When a conditional
action is selected, use this field to enter a JavaScript expression that identifies which action (INSERT
, UPDATE
, DELETE
or MERGE
) should be used.
Read how to configure a conditional action.
Change replication
In Etlworks, it is possible to create a Flow that tracks changes in a data source. For example, database or API, and loads the changed records into the destination, for example another database or API. Another name for change replication is data synchronization.
Etlworks supports unidirectional change replication: from the source to the destination. To replicate changes in the opposite direction –– from the destination to the source –– a separate Flow must be created, where the source and the destination swap places.
Read more about configuring a change replication in Etlworks.
High Watermark Field
A field used as a high watermark for the change replication.
Enable Change Replication
Enable or disable change replication. This works together with High Watermark Field
.
High Watermark Field Value on First Run
The optional value of the High Watermark Field
when running change replication scenario first time. If the High Watermark Field
is a TIMESTAMP
use <code>YYYY-MM-DD HH:mm:SS.mil</code>
Format, for example, <code>2018-08-01 19:38:31.377</code>
. The value set for this field will be used only when the change replication Flow is running the first time and will be ignored on the next run.
High Watermark Field Value
To make a change replication Flow bulletproof, you can specify a SQL query which will be calculating the High Watermark Field Value
on the fly, based on the current state of the destination table. Basically, it will be pulling only the records in the source which do not exist in the destination or have changed in the source since the last run of the Flow.
Handling exceptions
The parameters in this group can be found under the tab On Exception
.
On Exception
By default, any error causes a halt of execution. When ignore
is selected, the errors will be ignored and execution will continue. Read how to configure the Flow to ignore all or specific exceptions.
Exception Mask
You can specify what errors should be ignored and still halt the execution for all other errors. Enter part or all of the exception string. This field works only when the Ignore On Exception
option is selected.
Rollback on Exception
Some databases (for example, PostgreSQL) require transaction rollback for all exceptions.
Comments
0 comments
Please sign in to leave a comment.