Version
New
1. Asynchronous POST/PUT listener requests throttling and queuing
Affected area: user-defined APIs
User request: Implement priority queue and throttling for user-defined APIs (Listeners)
When you create user-defined APIs that allow sending data to your Flows (PUSH API), it is not always possible to control how often and how much data your client will send. To prevent the system from degradation, or complete overload in case of a large number of calls with huge payloads in a short period of time, it is possible to configure throttling and to queue for asynchronous POST
and PUT
requests. By utilizing throttling and queuing, you can specify how many calls should be processed per minute and whether excess calls should be queued, as well as in which order queued messages should be processed.
Read more about request throttling and queuing.
2. Added ability to monitor currently executed scripts and tasks
When Etlworks Integrator executes SQL scripts, JavaScript, and Python programs, extracts and loads data, performs transformations and operations with files, we record the essential parameters of the running tasks: the start time, the duration, the script that is being executed, etc. The currently running tasks can be monitored in real-time.
Note that it is different from the Flow's log:
- You can see the actual script that is being executed.
- It is easier to navigate than the Flow's log because (unlike entries in the log) the tasks are ordered by the start time and duration.
Read more about monitoring the currently executed scripts and tasks.
3. Improvements for log-based change replication (CDC)
3.1 Upgraded Debezium to 1.4.2.final
Affected areas: CDC Flows, CDC Connections
We upgraded the change data capture engine (Debezium) to the latest stable release, 1.4 2.final. Previously, Etlworks Integrator was using pre-release Debezium 1.4.cr1.
3.2 Improved startup time of the CDC engine
Affected areas: CDC Flows, CDC Connections
Prior to this release, all CDC connectors (except MySQL and MongoDB) were reading metadata for all tables in monitored databases/schemas regardless of the inclusion/exclusion list. Essentially, before snapshot and streaming, we refresh the schema from the database by reading every table & its columns; we then filter that data accordingly based on the include/exclude lists. On large production databases, it could take up to a half-hour of constant querying to load the required data.
In this update, we optimized the startup sequence, so now the CDC connectors apply a filter (inclusion/exclusion list) before reading the metadata.
3.3 Added support for Transaction Markers
Affected areas: CDC Flows, CDC Connections
Consider, for instance, an e-commerce application that manages purchase orders. In a relational database, the information representing such order will typically be spread across multiple tables, e.g. PURCHASE_ORDER
, ORDER_LINE
, and SHIPMENT_ADDRESS
. To examine the complete state of a purchase order in the database itself, you’d run a JOIN
query against the three tables.
Things get more challenging when looking at the change events produced by CDC connectors.
After this update, the situation can be addressed by leveraging the new transaction metadata supported by most CDC connectors. When enabled, separate transaction markers will be created at the begin
and end
of each transaction.
Read how to enable transaction markers.
The transaction markers are not available for the MongoDB CDC connector. MySQL CDC connector requires enabling the GTID.
3.4 Added ability to configure the CDC stream to never stop
Affected areas: CDC Flows, CDC Connections
It was always possible to configure the CDC stream to stop automatically if there are no new CDC events.
It is now possible to configure it to run forever. If none of the parameters in this group are set, the CDC Flow will run until manually stopped.
To stop CDC Flow manually, click Stop
/ Cancel
.
3.5 Added ability to set the maximum size of the CSV file created by the CDC connector
Affected area: CDC Connections
In this update, we added the ability to set the maximum size of the CSV file created by the connector. It works together with other parameters in the group of parameters below.
3.6 Added ability to programmatically modify CDC event and CDC key
Affected areas: CDC Flows, CDC Connections
The CDC event represents a row in the CDC stream. The CDC key represents the unique identifier of the group of events. All events with the same CDC key are stored in the same file.
Read how to use the preprocessor to modify the event, the key, and filter out the specific events altogether.
3.7 Added ability to convert geometry data encoded in WKB Format to JSON
Affected area: CDC Connections
By default, the geometry data extracted by the CDC connector is encoded in the Well-Known-Binary Format (WKB).
In this update, we added the ability to automatically convert WKB to human-readable JSON. This is specifically useful when streaming data from Postgres tables with geometry columns. Other databases support geometry data types as well, so it is not limited to Postgres only.
3.8 Added ability to ignore DELETE CDC events when loading data into the relational database
Affected areas: Load CDC event from CSV files when the destination is a relational database
If the SQL action is set to CDC MERGE
it is now possible to configure the load Flow to skip DELETE
events by enabling the property Do not execute DELETE when Action is Record with MERGE
.
It greatly improves the performance of the MERGE
because it automatically enables batch processing, otherwise disabled for the CDC MERGE
action.
3.9 Added ability to stream CDC events from the specific binlog file and position (MySQL)
Affected areas: CDC Flows, MySQL CDC connector
It is now possible to configure the MySQL CDC connector to start streaming from the specific binlog file and position.
It is only available when the connector is switching from the snapshot to streaming mode, and the snapshot mode is set to initial
or schema_only
.
3.10 Added streaming output plugins for Postgres CDC connector
Affected areas: CDC Flows, Postgres CDC connector
If you are using a wal2json
plug-in and transactions are very large, the JSON batch event that contains all transaction changes might not fit into the hard-coded memory buffer, which has a size of 1 GB. In such cases, switch to a streaming plug-in by setting the Output Plugin
parameter to wal2json_streaming
or wal2json_rds_streaming
.
Read more about available output plugins.
4. New and improved transformations
4.1 Added ability to use inline SQL in Mapping
Affected areas: source-to-destination transformation when the destination is a relational database, mapping
In Etlworks Integrator, it was always possible to use JavaScript and Python to calculate the column's value in the mapping.
In this update, we added the ability to use SQL for calculating the column's value.
Passing calculation logic to the destination database can greatly improve the performance of the data injection pipeline compare to using JavaScript or Python because there will be no context switch between SQL and another scripting language.
Read more about using inline SQL in Mapping.
4.2 Added ability to programmatically modify Source Query and set variables used in the query
Affected areas: source-to-destination transformation
In this update, we added the ability to programmatically modify Source query
and set the Flow variables used as parameters in the source query.
The most common use case is when the source query has {tokens}
that you want to substitute on the values of the Flow variables at runtime.
Read more about modifying the
Source query
and setting variables used in the query.
4.3 Added ability to modify the Create table SQL
Affected areas: source-to-destination transformation
By default system automatically generates an SQL for creating a new table if it does not exist.
In this update, we added the ability to control how the destination table is getting created.
There are two new properties available under MAPPING
> Parameters
:
4.4 Create temporary table
If this parameter is enabled and Create target table
parameter also enabled the Flow will create a temporary destination table (Create temporary table
xyz). The target database must support temporary tables. The temp tables can be used to stage data before moving data to the actual tables.
4.5 Create table SQL
You can override the default SQL by providing a template, for example:
create global temporary table {TABLE} ({COLUMNS}})
Available tokens: {TABLE}
and {COLUMNS}
You can also write a program in JavaScript to dynamically generate the Create table SQL
, for example:
value = 'create global temporary table ' + table + '(' + columns + ', primary key id)';
Available variables:
- etlConfig
- scenario
- dataSet
- destination
- table: the table name
- columns: the comma-separated list of columns
4.6 The ability to create a new destination Connection for each source-to-destination pair
Affected area: source-to-destination transformation
Prior to this update, Etlworks Integrator Flows, where the destination is a database, were reusing the same destination Connection for each source-to-destination pair.
In this update, we added the ability to dynamically create a new physical database Connection even when the same destination Connection is configured for each pair. It does include loading data into multiple tables using the wildcard source name.
Enabling this option can improve performance.
The new parameter can be found under transformation
> MAPPING
> Parameters
. It is disabled by default.
4.7 Added Preprocessor for HTTP Connection
Affected area: HTTP connector
It is now possible to programmatically change the parameters of the HTTP Connection using a program in JavaScript. This program can be entered in the Preprocessor
field available for the HTTP Connection.
The most common use case is when you need digitally sign the HTTP request before sending it to the server, which is quite typical for SOAP-based APIs or for APIs that require a complicated authentication protocol.
Read more about HTTP Connection Preprocessor
.
5. Working with files
5.1 Added ability to cd to the specific directory in the SFTP server
Affected area: SFTP Connection
In this update, we added a new parameter to the SFTP connector: Change current remote directory to
.
If this parameter is set, the connector will cd to this specific directory after connecting the SFTP site. It is useful if you want to access a folder from a level or two back from the home folder.
5.2 Added ability to group rows in the Excel XSLX worksheet
Affected area: Excel XLSX Format
It is now possible to automatically group rows when creating an Excel XSLX worksheet. Read more about outlining (grouping) data in a worksheet.
To enable grouping, simply set the first and [optionally] the last row to include in a group for the Excel XSLX Format.
5.3 Added ability to hide parent node in the nested XML array
Affected area: XML Format
By default XML connector creates a parent node for nested XML arrays. The name of the node is the same as the name of the first node in the array.
<data>
<prescriber>Mr Smith</prescriber>
<patient>
<patient>John Doe</patient>
<patient>Jane Doe</patient>
</patient>
</data>
If this option is enabled, the connector will not create the parent node.
<data>
<prescriber>Mr Smith</prescriber>
<patient>John Doe</patient>
<patient>Jane Doe</patient>
</data>
5.4 Added ability to create CSV files with the specific order of columns
Affected areas: CSV Format, Mapping
It is now possible to create CSV files with the specific order of columns. Simply enable the option Reorder columns based on the order of columns in mapping
for the destination CSV Format and configure the order of fields in Mapping.
6. Automatic PGP encryption
Affected areas: source to destination transformation, server storage Connection
In this update, we added the ability to automatically encrypt files created in the server storage using the PGP algorithm and uploaded Public Key
.
When configuring the destination (TO
) server storage Connection simply upload the Public Key File
, and the transformation will handle the rest. If the key is available, all files created by the transformation will be automatically encrypted using the PGP algorithm.
7. Improvements for automatic data type Mapping
The automatic data type Mapping mostly affects creating tables (when the table does not exist) in the destination database. The engine assumes the data type of the column in the destination table based on the data type of the column in the source and rules set for the specific ETL driver. We tweak the rules from time to time to make sure they provide the best possible match.
7.1 Some database connectors now support defaulting data types to VARCHAR
Affected areas: database connectors, premium connectors
User request: CData Smartsheet data type conversion
When configuring a database Connection, which is not one of the following: SQL Server, DB2, MySQL, PostgreSQL, Informix, Oracle, Sybase ACE, it is now possible to default the data type of all columns in the table that will be automatically created to VARCHAR
. The size of the VARCHAR
column will be set to the maximum available for the specific database.
7.2 PostgreSQL NUMERIC with no precision is now supported
All versions of the PostgreSQL database support NUMERIC
data type with no precision. The previous behavior was to convert a column with a NUMERIC
data type to NUMERIC(max_allowed_precision)
.
In this release, we correctly identify the NUMERIC
column with no precision and preserve the data type.
7.3 Redshift BPCHAR and FLOAT with no precision are now supported
Redshift JDBC driver reports BPCHAR
columns as BPCHAR(255)
and FLOAT
as FLOAT8(17,17)
, most likely because the driver is based on Postgres JDBC driver.
After this update, our ETL engine is able to correctly recognize the actual data type and preserve PBCHAR
and FLOAT
columns with no precision.
8. Improvements for Snowflake, Redshift, and Synapse Analytics Flows
8.1 Added ability to reorder CSV columns during load into Snowflake table
Affected area: loading data into Snowflake
Snowflake COPY INTO <table>
command requires that the order of columns in the staged CSV file should match the order of columns in the destination table. Our Snowflake connector was always able to reorder the columns during the extract. It requires reading the entire staged file and recreating it with a new order of columns before executing the COPY INTO
command.
Snowflake supports transforming data while loading it into a table using the COPY INTO SELECT FROM
command, dramatically simplifying the ETL pipeline for basic transformations. This feature helps you improve performance when reordering columns during a data load.
In this update, we added the ability to configure Flow to automatically reorder the columns in the staged CSV file during the load to match the order of columns in the destination table.
8.2 Added ability to use global variables in Lookup Fields
Affected areas: ETL into Snowflake, ETL into Redshift, ETL into Synapse Analytics
Lookup Fields
is a comma-separated list of columns that uniquely identify the record in the destination table. If Predict Lookup Fields
is disabled, this parameter is required for MERGING data in the source with the data in the destination table.
In this update, we added the ability to use {global variables}
in the Lookup Fields
.
9. Added ability to suspend Flow executions
In this release, we add the ability to temporarily suspend the execution of all Flows. All Schedules remain active, but the Flows won't start, even if triggered manually or by API call.
Self-hosted customers can use it prior to installing the latest update to make sure that no Flows are running during the upgrade process.
The do not run
flag can be set and cleared from UI or by calling the suspend and resume API endpoints.
The flag is automatically cleared after the restart.
On a multi-node setup, the flag is set and cleared for all instances in the cluster.
9.1 Suspending Flows from UI
To set the flag go to Settings
> Flows
and click Suspend flow executions
.
Optionally enable Stop currently running flows
and click Suspend
.
The label suspended
will appear in the header.
9.2 Resuming Flows from UI
To clear the flag manually, click the Resume flow executions
.
9.3 Suspending Flows by calling an API endpoint
Read about suspend API.
9.4 Resuming Flows by calling an API endpoint
Read about resume API.
10. Other improvements
10.1 Reduced wait time between Flow runs to 1 second
Affected area: scheduling continuously running Flows
In this update, we have reduced the minimum wait time between Flow runs from 30 seconds to 1 second
.
For all intents and purposes, it an equivalent to running the Flow in real-time.
10.2 Added ability to get the last Flow execution with the specified status using API
Affected area: GET Flow status API
In this update, we added a new query parameter to the GET Flow status API:
status=string
.
Example: https://app.etlworks.com/rest/v1/executions/1234?status=success
.
Use this parameter to get the last Flow execution with the specified status.
11. Usability and UX improvements
11.1 Improved filters in Etlworks Explorer
Affected area: Etlworks Explorer
It is now easier than before to filter objects and columns in Etlworks Explorer.
11.2 Canceling requests to retrieve objects and columns
Affected area: source to destination transformation
It is now possible to cancel the long-running request to populate the objects and columns available for the specific source (FROM
) and destination (TO
) Connection.
11.3 Dark mode in Flow log
Affected area: Flow log
In this release, we added the ability to switch to the dark mode in the Flow log.
Comments
0 comments
Please sign in to leave a comment.