Continuous delivery model
Starting November 2021, we are switching to the continuous delivery model. With this model, bug fixes, new features, and enhancements are released as soon as they are ready. The updates are automatically deployed to the individual Etlworks instances on the rolling schedule. Read the full announcement.
What's New?
Connectors
In this release, we have added two new connectors:
We also added the ability to connect to the sandbox account when using Salesforce with the OAuth2 connector.
Streaming from message queues
In this release, we have added the ability to stream real-time data from Kafka and Azure Event Hubs to practically any destination. Before this update, Etlworks only supported extracting data from queues in micro-batches.
New functionality
We significantly improved the streaming of the CDC events from the message queues to any supported destination:
- Streaming CDC events that were ingested by Etlworks CDC connector.
- Streaming CDC events that were ingested by standalone Debezium.
We have added preprocessors to Kafka and Azure Event Hubs connectors:
- Consumer preprocessor - use this preprocessor to change the message streamed from the queue.
- Producer preprocessor - use this preprocessor to modify the message added to the topic.
We have added the Posptocessor to the HTTP connector. The Postprocessor can be used to change the response content programmatically.
We have added a new configuration option to the CSV format, which allows reading CSV files with non-standard BOM characters.
We have added an option to automatically add an Excel worksheet name to the file name in Explorer and Mapping. It simplifies working with Excel files which include multiple worksheets. Read more.
We have added an option that allows the Excel connector to read data from worksheets that don't have a formal "columns" row.
We have added the ability to start the flow as a Daemon. Read more.
Important bug fixes and improvements under the hood
We improved the performance of the S3 SDK connector when reading the list of files available in the bucket.
We have fixed the edge case when the CDC Binlog reader was not disconnecting on error.
We have fixed the issue with using bind variables for columns with UUID data type.
Self-managed on-premises installers
Etlworks is a cloud-native application that works perfectly well when installed on-premises. Prior to this update, your would need to contact Etlworks support in order to receive a link to download an installer and a unique license generated for your organization.
In this update, we introduced a self-managed web flow that allows you to create a company account and download a fully automated installer for Linux or/and Windows. The installer includes a unique license generated for your organization. You can use the same installer to upgrade Etlworks Integrator to the latest version.
Supported operating systems are Amazon Linux 2, Ubuntu 18.04, Ubuntu 20.04, CentOS 7, Red Hat 7, Red Hat 9, Windows Server (2012-2022), and all editions of Windows Windows 10 and Windows 11.
Windows installer
It was always possible to run Etlworks on Windows. Still, unlike running Etlworks on Linux, it required manual installation of all components needed to run Etlworks, such as Java, Tomcat, Postgres, and Redis. In this release, we have added official support for all modern server and desktop versions of Windows.
- Install Eltworks Integrator on Windows.
- Automatically update Etlworks Integrator installed on Windows.
Connectors
In this release, we have added eleven new connectors:
- Clickhouse.
- Excel as database (premium).
- Microsoft Dynamics 365 (premium). This connector supports the following Dynamics editions: Sales, Customer Service, Field Service, Fin Ops, Human Resources, Marketing, Project Operations.
- Oracle Cloud CSM (premium).
- Oracle Cloud ERP (premium).
- Oracle Cloud HCM (premium).
- Oracle Cloud Sales (premium).
- Monday (premium).
- JDBC-ODBC bridge (premium). This connector allows you to access ODBC data sources from Etlworks.
- FHIR as database (premium).
- GraphQL (premium).
We have also updated the following existing connectors:
- Upgraded Google BigQuery JDBC driver to the latest available from Google.
- Added ability to login with Azure quest user to SQL Server connector. Read more.
- Added API Token and Basic Authentication to premium Jira and Jira Service Desk connectors.
Upgraded CDC engine
We have upgraded our CDC engine (Debezium) from 1.9 to the latest 2.1.
New functionality
- User request. It is now possible to add named connections to all source-to-destination flows. Read more.
- It is now possible to override the command which executes the Greenplum gpload utility. Read more.
- It is now possible to connect to a read-only Oracle database when streaming data using CDC. Read more.
- We have added more configuration options for CSV and JSON files created by CDC flows. Read more.
- We have improved logging for CDC connectors when capturing transaction markers is enabled. Read more.
- We have improved logging for loops by adding begin/end markers.
Important bug fixes
- SSO JWT expiration is now the same as regular JWT expiration (which is configurable by end-users). Before this fix, customers with enabled SSO were experiencing frequent logouts under certain conditions.
- We fixed an issue with FTPS connector, which was unable not connect if FTPS server was running behind the load balancer or proxy.
- We fixed an edge case when AWS credentials were exposed in the flow log when the Snowflake flow failed to create the Snowflake stage automatically.
UX improvements
It is now possible to quickly create Connections, Listeners, Formats, Flows, Schedules, Agents, Users, Tenants, and Webhooks from anywhere within the Etlworks UI without switching to a different window. Read more.
New functionality
We significantly improved support for PGP encryption:
- It is now possible to generate a pair of PGP keys using a designated flow type. Read more.
- All Etlworks file storage connectors now support automatic decryption of the encrypted files during ETL operations. Read more.
We improved the mapping when working with nested datasets. It now supports the case when the source is a nested document, but you only need data from the specific dimension. Read more.
Connectors
We added OAuth authentication (Sign in with Microsoft) to our Sharepoint storage and OneDrive for Business connectors.
We added a Stripe premium connector.
We upgraded the built-in SQLite database from version 3.34.0.0 to the latest version 3.40.0.0. SQLite is used as a temporary staging db. Read more about SQLite releases.
Documentation
We completely rewrote a section of the documentation related to working with nested datasets.
UX improvements
It is now possible to create Connections, Formats, and Listeners right in the Flow editor without switching to the Connections window. Read more.
New functionality
Etlworks now supports Vertica as a first-class destination and as a source. Read more.
We have added point-to-point Change Data Capture (CDC) flows for multiple destinations. After this update, you can create a CDC pipeline using a single flow instead of separate extract and load flows.
- Change Data Capture (CDC) data into Snowflake.
- Change Data Capture (CDC) data into Amazon Redshift.
- Change Data Capture (CDC) data into BigQuery.
- Change Data Capture (CDC) data into Synapse Analytics.
- Change Data Capture (CDC) data into Vertica.
- Change Data Capture (CDC) data into Greenplum.
- Change Data Capture (CDC) data into any relational databases.
- Change Data Capture (CDC) data into any relational databases using bulk load.
We have added bulk load flows for several analytical databases:
- Bulk load files in Google Cloud Storage into BigQuery
- Bulk load files into Vertica
- Bulk load files in server storage into Greenplum
All flows optimized for Snowflake now support the automatic creation of the internal stage and external stage on AWS S3 and Azure Blob. Read more.
Changes under the hood
We improved the reliability of the message queue in the multi-node environment.
This is a required update.
We have fixed the memory leak in the Amazon S3 SDK connector. We also fixed a similar memory leak in AWS-specific connectors (Kinesis, SQS, RebitMQ, ActiveMQ) which use IAM role authentication. All instances managed by Etlworks have been upgraded. Self-hosed customers are highly advised to upgrade as soon as possible. Customers which have Integration Agents are encouraged to update the agents as well.
UX improvements
It is now possible to manage the Etlworks billing account and subscriptions from the Etlworks app. Read more.
It is now possible to access this changelog from the Etlworks app. Read more.
It is now possible to search for information in the Documentation and submit support requests from the Etlworks app. Read more.
It is now possible to resize all split panels (such as in Explorer, Connections, etc.) further to the right. It allows users to see long(er) filenames, connection names, and other objects.
We have added links to the running flows in the Suspend flow executions window.
We now display a warning message when a user is trying to create a non-optimized flow when the destination is Snowflake, Amazon Redshift, Synapse Analytics, Google BigQuery, or Greenplum. The warning message includes a link to the relevant article in the documentation.
New functionality
We have added a new flow type that can be used to create dynamic workflows which change based on user-provided parameters. Read more.
It is now possible to enter secure parameters (passwords, auth tokens, etc.) when adding parameters for running flows manually, by the scheduler, and by Integration Agent.
It is now possible to split CSV files using a user-defined delimiter or regular expression. Read more.
We have added the following new configuration options for the SMB share connector: SMB Dialect
, DSF Namespace
, Multi Protocol Negotiation
and Signing Required
. Read more.
CDC flows now never stop automatically unless they stopped manually or fail. Read more. Note that the behavior of the previously created CDC flows did not change.
We have improved the algorithm for creating the transaction markers by CDC flows. They now use the actual start/commit/rollback events emitted by the source database. Previously we were using the change of the transaction id as a trigger. It was creating a situation where the flow was waiting for a new transaction to start before creating an "end of previous transaction" event.
It is now possible to use flow variables as {parameters} in transformations and connections. Previously only global variables could be used to parameterize transformations and connections.
It is now possible to change the automatically generated wildcard pattern when bulk-loading files by the wildcard into the Snowflake and Synapse Analytics.
User request. A new option has been added to the Synapse Analytics bulk load flow which allows creating a new database connection for loading data into each Synapse table. Read more.
We have updated the HubSpot connector which now supports new authorization scopes introduced by HubSpot in August.
Various bug fixes and performance improvements under the hood.
Documentation
We have added new tutorials for creating CDC pipelines for loading data into Snowflake, Amazon Redshift, Azure Synapse Analytics, Google BigQuery and Greenplum. Read more.
Our redesigned main website (https://etlworks.com) went live.
MySQL CDC connector now supports reading data from the compressed binlog. Read more.
It is now possible to disable flashback queries when configuring the Oracle CDC connection. This could greatly improve the performance of the snapshot in some environments. Read more.
CDC connectors can now be configured to capture NOT NULL
constraints. Read more.
Legacy S3 and Azure Storage connectors have been deprecated. The existing legacy connections will continue to work indefinitely but new connections can only be created using S3 SDK and Azure Storage SDK connectors.
Bulk load flows are now ignoring the empty data files.
Bulk load flows which are loading data from Azure Storage now support traversing all subfolders under the root folder.
The flows that extract data from the nested datasets and create staging tables or files can be now configured to not create tables for dimensions converted to strings. Read more.
User request. It is now possible to add record headers when configuring Kafka and Azure Events Hubs connections. Record headers are key-value pairs that give you the ability to add some metadata about the record, without adding any extra information to the record itself.
The BigQuery connector now maps the ARRAY data type in the source database (for example Postgres) to STRING in BigQuery.
Fixed bug which was causing a recoverable NullPointerException (NPE) when saving flow execution metrics.
Various bug fixes and performance improvements under the hood.
Single Sign On (SSO) is now available to all Etlworks Enterprise and On-Premise customers. Read more.
We have added a bulk load flow for loading CSV and Parquet files in Azure Storage into Azure Synapse Analytics. It provides the most efficient way of loading files into Synapse Analytics. Read more.
We have optimized loading data from MongoDB into relational databases and data warehouses such as Snowflake, Amazon Redshift, and Azure Synapse Analytics. It is now possible to preserve the nested nodes in the documents stored in MongoDB in the stringified JSON format. Read more.
The Flow Bulk load files into Snowflake
now supports loading data in JSON, Parquet, and Avro files directly into the Variant
column in Snowflake. Read more.
The Override CREATE TABLE using JavaScript now supports ALTER TABLE as well. Read more.
It is now possible to connect to Snowflake using External OAuth with Azure Active Directory. Read more.
The Azure Events Hubs connector now supports compression. Read more.
The Flows Executions Dashboard now displays the aggregated number of records processed by the specific flow on the selected day. It could be useful when monitoring a number of records processed by the CDC pipeline, which typically includes 2 independent flows (each with its own record tracking capabilities): extract and load.
We have improved the Flow which creates staging tables or flat files for each dimension of the nested dataset. It is now possible to alter the staging tables on the fly to compensate for the variable number of columns in the source. We have also added the ability to add a column to each staging table/file that contains the parent node name. Read more.
It is now possible to authenticate with SAS token and Client Secret when connecting to Azure Storage using the new Azure Storage SDK connector. Note that the legacy Azure Storage connector also supports authentication with SAS token but does not support Client Secret.
We have updated the Sybase JDBC driver to the latest version.
It is now possible to use global variables when configuring parameters for split file flows.
We have fixed the soft deletes with CDC. This functionality was broken in one of the previous builds.
User request. It is now possible to override the default Create Table SQL generated by the flow.
User request. The Flow Executions dashboard under the Account dashboard now includes stats for flows executed by the Integration Agent.
User request. It is now possible to use global and flow variables in the native SQL used to calculate the field's value in the mapping.
It is now possible to filter flows associated with the Agent by name, description, and tags.
It is now possible to configure and send email notifications from the Etlworks instance for flows executed by the Agent.
It is now possible to bulk load CSV files into the Snowflake from the server (local) storage. Previously it was only possible to bulk load files into the Snowflake from the S3, Azure Blob, or Google Cloud storage. The flow Load files in cloud storage into Snowflake
was renamed to Bulk load files into Snowflake
. Note that it was always possible to ETL files into the Snowflake from server (local) storage.
The flow Bulk load CSV files into the Snowflake now supports loading files by a wildcard pattern in COPY INTO and the ability to handle explicit CDC updates when the CDC stream includes only updated columns.
MySQL CDC connector now supports useCursorFetch property. When this property is enabled the connector is using the cursor-based result set when performing the initial snapshot. The property is disabled by default.
All CDC connectors now test the destination cloud storage connection before attempting to stream the data. If the connection is not properly configured the CDC flow stops with an error.
The Debezium has been upgraded to the latest 1.9 release.
It is now possible to add a description and flow variables to the flows scheduled to run by the Integration Agent. Read about parameterization of the flows executed by Integration Agent.
We have added a new premium Box API connector.
Snowflake, DB2, and AS400 JDBC drivers have been updated to the latest and greatest.
We introduced two major improvements for Change Data Capture (CDC) flows. The previously available mechanism for the ad-hoc snapshots using a read/write signal table in the monitored schema has been completely rewritten.
- It is now possible to add new tables to monitor and snapshot by simply modifying the list of the included tables. Read more.
- It is now possible to trigger the ad-hoc snapshot at runtime using a table in any database (including a completely different database than a database monitored by CDC flow) or a file in any of the supported file storage systems: local, remote, and cloud.
Webhooks now support custom payload templates. The templates can be used to configure integration with many third-party systems, for example, Slack.
We added a ready-to-use integration with Slack. It is now possible to send notifications about various Etlworks events such as flow executed, flow failed, etc., directly to the Slack channel.
The S3 SDK connector now supports automatic pagination when reading files names by a wildcard.
Amazon Marketplace connector now supports Sign in with Amazon
and Selling Partner API (SP-API). MWS API has been deprecated and is no longer available when creating a new connection.
Magento connector now supports authentication with Access token.
Etlworks Integrator now supports Randomization and Anonymization for various domains, such as names, addresses, Internet (including email), IDs, and many others.
We added a new flow type: Bulk load files in S3 into Redshift. Use the Bulk Load Flow when you need to load files in S3 directly into Redshift. This Flow is extremely fast as it does not transform the data.
The Redshift driver now automatically maps columns with a SMALLINT
and TINYINT
data types to INTEGER
. It fixes the issue when Redshift is unable to load data into the SMALLINT
column if the value is larger than 32767
.
CSV connector can now read the gzipped files. It works in Explorer as well.
The connector for fixed-length format can now parse the header and set the length of each field in the file automatically. Read more.
It is now possible to override the default key used for encryption and decryption of the export files.
Users with the operator
role can now browse data and files in Explorer.
It is now possible to override the storage type, the location, the format, whether the files should be gzipped, and the CDC Key set in the CDC connection using TO-parameters in source-to-destination transformation. Read more.
We added a new S3 connector created using the latest AWS SDK. It is now a recommended connector for S3. The old S3 connector was renamed to Legacy
. We will keep the Legacy connector forever for backward compatibility reasons.
It is now possible to see and cancel actions triggered by the end-user to be executed in Integration Agent. When the user triggers any action, such as Run Flow, Stop Flow, Stop Agent the action is added to the queue. The actions in a queue are executed in order on the next communication session between the Agent and the Etlworks Integrator. Read more.
We added an SMB Share connector. Among other things, it supports connecting to the network share over the SSH tunnel.
Google Sheets connector now supports configurable timeout (the default is 3 minutes) and auto-retries when reading data.
The Flow types Extract nested dataset and create staging files
and Extract nested dataset and create staging tables
which are used to normalize nested datasets as relational data model now support message queues as a source.
It is now possible to configure the CDC connection to send records to the specific Kafka or Azure Event Hub partition. Read more.
Legacy MySQL CDC connector now provides information about the current and previous log readers. It is specifically useful when the connector is configured to automatically snapshot new tables added to the Include Tables list.
The Integration Agent is a zero-maintenance, easy-to-configure, fully autonomous ETL engine which runs as a background service behind the company’s firewall. It can be installed on Windows and Linux. The Remote Integration Agent is now fully integrated with the cloud Etlworks instance. You can monitor the Agent in real-time, schedule, run, stop and monitor flows running on-premise. Read how to install, configure, and monitor the new Integration Agent. Read about configuring flows to run in the Integration Agent.
We added a new flow type: Bulk load files into the database without transformation. Use the Bulk Load Flow when you need to load files in the local or cloud storage directly into the database which supports a bulk load. This Flow does not transform the data. Read how to ETL data into databases using bulk load.
Etlworks is now shipped with the latest stable Debezium release (1.8). We support all features introduced in 1.8 and much more. Read about creating Change Data Capture (CDC) flows in Etlworks.
Load files in cloud storage into Snowflake now supports creating all or selected columns as TEXT which mitigates issues caused by the source schema drift. Read more.
It is now possible to create a new Google Sheets spreadsheet if it does not exist. Read more.
Bulk load flows now support splitting large datasets into smaller chunks and loading chunks in parallel threads. Read more about performance optimizations when loading large datasets using bulk load flows.
It is now possible to configure the CSV format to enclose the columns in the header row in double quotes. Previously only values could be enclosed.
Added mapping to the Flow type Load files in cloud storage into Snowflake. It is now possible to globally rename and exclude columns for all tables.
Added new connectors for messages queues:
Added the ability to convert nested objects to strings (stringify) when creating staging tables or files from the nested JSON and XML datasets.
Fixed an error in the Snowflake bulk load flow when the schema name starts with non-SQL characters, for example 123abc
.
Added JavaScript exception handler. It is now possible to execute a program in JavaScript in case of any error.
The POST/PUT listeners can be now configured to not enforce the strict UTF-8 encoding.
The flow can be now configured to fail if the source field in mapping does not actually exist in the source.
Fixed an error when the field in mapping contains trailing or leading spaces.
Added the ability to send email notifications to the configurable emails addresses when the webhook is triggered by the event.
Added programmatic sequence generators which can be used from JavaScript and Python code.
Added the new Flow type: bulk email reader. This flow reads the email messages (including attachments) from the inbound email connection and saves them as files into the designated folder in the server storage. Use this flow type when you need to read hundreds or thousands of emails as fast as possible from the relatively slow inbound email connection.
Added new collision policy when importing previously exported flows: Keep all, replaces Flows and Macros
. Use this policy if you are migrating flows from one environment to another and prefer to keep existing connections and formats in the destination environment.
The Server Storage connection now defaults the Directory to the Home folder (app.data).
MongoDB CDC connector now supports MongoDB change streams.
Added the ability to configure flow so it can not be executed manually. Enable it if the Flow is a part of the nested Flow and is not meant to be executed independently.
Added transformation status column to the flow metrics. It works the best together with the option to Retry failed transformations.
Fixed an issue causing intermediate errors when sending and receiving emails from/to servers with enabled TSL 1.1.
Added the ability to configure the type of SQL executed when merging data using Snowflake and Bulk Load flows. The default option is DELETE/INSERT. It deletes all records in the actual table that also exist in the temp table, then inserts all records from the temp table into the actual table. If this parameter is set to MERGE the flow executes native MERGE SQL.
Added binary format selector for Snowflake flows. The value of this parameter defines the encoding format for binary input or output. This option only applies when loading data into binary columns in a table. Default: HEX.
Added the ability to send a test email and set the FROM when configuring the SMTP for sending email notifications.
Added the ability to configure the number of records when sampling nested data structures. It allows the Explorer and the Mapping to more accurately set the column's data types when working with non-relational datasets, for example, JSON and XML files.
Parquet and Avro connectors now support automatic schema generation. Prior to this update, the developer would need to provide a schema in order to create Parquet or Avro document.
Added support for the following gpload flags used by the Flows optimized for Greenplum.
Added new authentication types (including interactive authentication with Azure Active Directory) for SQL Server connector.
All bulk load flows (Snowflake, Redshift, BigQuery, Greenplum, Azure Synapse, generic Bulk load) now support extra debug logging. When option Log each executed
SQL statement
is enabled the flow will log each executed SQL statement, including Before SQL, After SQL, CREATE and ALTER TABLE, COPY INTO, MERGE, DELETE, etc.
Added Max Latency Date for the CDC metrics.
Added new webhook types:
- Webhook for the event when the flow is stopped manually or by API call.
- Webhook for the event when the flow is stopped by the scheduler because it has been running for too long.
- Webhook for the event when the flow is running too long.
- Webhook for the event when the maintenance task is not able to refresh the access token for the connection configured with Interactive Azure Active Directory authentication.
- Webhook for the event when the flow generates a warning, for example, "source table has more columns than destination table".
Added TABLE_SCHEMA
and TABLE_DB
flow variables for flows which load data from multiple tables matching the wildcard name. These variables can be used in Before and After SQL, as well as Source Query.
Added the ability to create all columns as nullable when creating a database table.
Added the ability to retry failed transformations on the next run.
Added the ability to use S3 authentication with IAM role or default profile.
Added password authentication for Redis.
Added the ability to execute CDC snapshots in parallel threads.
Fixed an edge case when MySQL CDC connection configured with SSH tunnel was not closing the SSH connection when switching from snapshot reader to the binlog reader.
Added support for shared Google Drive. If this option is enabled (default) the connector supports MyDrives and Shared drives. When it is disabled it only supports MyDrives.
Added the ability to configure a database connection to open when needed.
The option to modify the Create table SQL is now available for all bulk load flows.
Added new flow type: Load files in cloud storage into Snowflake. This flow is the most efficient way of loading data into Snowflake when you already have CSV files in the cloud storage (Amazon S3, Google Cloud Storage, Azure Blob) and don't need to transform the data.
Comments
0 comments
Please sign in to leave a comment.