Overview
Vertica is an analytical database and query engine that leverages the power of cloud-based technologies but can be deployed on-premises as well. Etlworks includes several pre-built bulk-load Flows optimized for Vertica.
Flows optimized for Vertica
Flow type | When to use | |
|
|
When you need to extract data from any source, transform it and load it into Vertica. |
Bulk load files into Vertica | You are here | When you need to bulk-load files that already exist in local or cloud storage without applying any transformations. The flow automatically generates the COPY command and MERGEs data into the destination. |
Stream CDC events into Vertica | When you need to stream updates from the database which supports Change Data Capture (CDC) into Vertica in real-time. | |
Stream messages from a queue into Vertica | When you need to stream messages from the message queue which supports streaming into Vertica in real time. |
When to use this Flow type
Use this Flow type when:
- You want to load CSV files that already exist in the cloud or local storage into the Vertica without applying any transformations or transferring files over the network.
- When you want to load CSV files created by the CDC Flow.
This Flow does not transform the data. Read how to ETL data into Vertica using bulk load.
Prerequisites
Vertica database is up and running and accessible from the Etlworks instance. Read how to work with on-premise data in Etlworks.
How it works
- The Flow reads the names of all files matching the wildcard in the specific location in the cloud or file storage. It can traverse the subfolders as well.
- The Flow calculates the destination table names based on the source file names.
- The Flow creates multiple threads, one per destination table but not more than the configurable threshold.
- The Flow generates and executes the COPY command to load files into the staging table.
- The Flow mergers data in the staging table with data in the actual table.
Features
The Flow supports INSERT, MERGE
(UPSERT), and CDC MERGE
. With CDC
MERGE
, the Flow updates the actual destination table by applying INSERT
/UPDATE
/DELETE
events in the same order as they originated in the source database and recorded in CSV files. It performs it in the most efficient way possible.
Other Features | Description |
Monitor source schema changes | The Flow is able to CREATE and ALTER destination tables to keep up with schema changes in the source. |
Delete loaded source files | The Flow can be configured to delete successfully loaded source files. |
Load data in parallel | The Flow is able to load data into multiple destination tables in parallel. |
Process files in subfolders | The Flow is able to process files in multiple nested subfolders. |
Create all or selected columns as VARCHAR(65000) | The Flow is able to create all or selected columns as VARCHAR(65000), which mitigates issues related to the source schema drift. |
Process
Step 1. Create a new storage Connection
This Connection will be used as a source to access files stored in the local or cloud storage.
The Connection can be one of the following:
Step 2. Create a new Vertica connection
This Connection will be used as a destination.
It is recommended to enable the auto-commit for the destination connection.
Step 3. Create a new CSV Format
Read how to create CSV format. Keep the default values for all parameters.
Step 4. Create a new Flow
This Flow will be used to load files in the file or cloud storage into the destination database.
In Flows
click [+]
, type in vertica
, and select the Bulk load CSV files into Vertica.
Step 5. Configure load transformation
Select or enter the following attributes of the transformation (left to right):
- File or Cloud storage Connection created in step 1.
- CSV Format created in step 3.
- A wildcard filename that matches the filenames in the cloud storage. Use
.csv
extension for uncompressed files and.gz
for compressed. - Destination database Connection created in step 2.
- The fully qualified wildcard destination table name, for example
public.*
.
Step 6. Set required and optional parameters
Click MAPPING
.
Select the Parameters
tab.
Select or enter the following optional and required parameters:
Source Files and Destination Tables
Include files in subfolders
: if this option is enabled, the Flow will load all files that match the wildcard filename (set inFROM
) in all subfolders under the main bucket or blob.Exclude and Exclude files
: the optional comma-separated list of files to exclude and/or include. You can use wildcard file names. These options used together with the wildcard filename set inFROM
:- The Flow populates the list of files matching a wildcard filename in
FROM
. - The Flow excludes files based on the list of exclusions set in
Exclude files
. - The Flow includes files based on the list of inclusion set in
Include files
.
- The Flow populates the list of files matching a wildcard filename in
Calculate destination table name
: this is a required parameter used to calculate (using JavaScript) the destination table name based on the source file name. The original file name, without the extension, is stored in the variablename
and the actual table name must be assigned to the variablevalue
. For example, let's assume that the source file name isdb_schema-table_cdc_stream_uuid.csv
. The JavaScript code to calculate the destination table name will bevalue = name.substring(name.indexOf('-') + 1, name.indexOf('_cdc_stream'));
. IMPORTANT: the flow will automatically recognize and transform filenames matching the following template:*_cdc_stream_*.*
. If filenames are matching this template there is no need to provide a transformation script.Maximum Number of Files to Process
: the optional parameter that controls how many files will be loaded in one batch (one run of the Flow). If you expect that at any point in time there will be hundreds of thousands or millions (10^6 - 10^7) of files in the source folder(s), it is recommended to limit the number of files to process to more manageable: hundreds to thousands (10^3-10^5).Maximum Number of Parallel Loads
: if the value of this optional parameter is greater than 1, the Flow will create multiple threads to load data in parallel –– one thread per table but not more than the threshold set in this parameter.Parallel
: enable this flag if you have more than one source-to-destination transformation that you want to execute in parallel.Create new destination connection for each table
: if this parameter is enabled the new destination connection will be created to load data into each destination table. The connections will be closed when the flow will finish loading data into all tables.Purge Files if Success
: if this parameter is enabled (default) the source files will be automatically deleted after they have been successfully loaded.
Debug
Log each executed SQL statement
: enable this flag if you want to log each automatically generated and executed SQL statement, including before/after Load SQL.
Error recovery
Continue loading data into other tables if Error
: if this parameter is enabled (it is disabled by default) and there is an error when loading data into some table the flow will continue loading data into other tables. If configured the notification will be sent to the webhook.
Load and Merge
Bulk Load SQL
: the Bulk Load SQL is used to load files in the cloud or file storage into the staging tables in Vertica database. This is a required parameter.By default, we assume that the flow bulk-loads from the local (server) storage using
COPY FROM LOCAL
command.COPY {TABLE} FROM LOCAL '{FULL_FILE_NAME}'
PARSER fcsvparser(header='true') ABORT ON ERROR;You can modify the SQL statement that will be used to bulk-load data into the destination table. The following
{tokens}
can be used as a part of the SQL statement:{TABLE}
: the table to load data into.{TEMP_TABLE}
: the staging table name used for MERGE.{FILE_TO_LOAD}
: the file name to load without path and extension.PATH
: the path of the file to load without file name, for example{app.data}/test/
.EXT
: the file extension of the file to load, without.
, for example,csv
.
Action
: can beINSERT
which inserts records from the file(s) into the destination table,MERGE
which merges records from the file with the records in the destination table, andCDC MERGE
which loads CDC events (INSERT
/UPDATE
/DELETE
) in the same order as they were originated in the source database.MERGE
andCDC MERGE
require configuring theLookup Fields
or/and enabling the .How to MERGE
: defines how flow merges data in the temp or staging table with the data in the actual table. The default isDELETE/INSERT
: DELETE all records in the actual table that also exist in the temp table, then INSERT all records from the temp table into the actual table. If this parameter is set toMERGE with separate DELETE
the flow will execute native Vertica MERGE SQLfor INSERTs and UPDATEs and separately DELETE for DELETEs.User-defined MERGE SQL
: by default, the MERGE SQL is generated automatically. Use this parameter to override automatically generated SQL. The following {tokens} are available:- {TABLE} - the destination table.
- {TEMP_TABLE} - the staging or temp table.
- {FIELDS} - the fields to INSERT/UPDATE in the destination table.
- {INSERT_FIELDS} - the values of the fields to INSERT into the destination table.
- {UPDATE_FIELDS} - the fields and values to UPDATE in the format field=value,field=value.
- {UPDATE_CONDITIONS} - the condition to match records in the actual destination and temp/staging tables.
Lookup Fields
: the comma-separated list of fields that uniquely identify the record in the destination table.- If this option is enabled and
Lookup Fields
is empty, the system will try to predict the columns that uniquely identify the record in the destination table.
ELT
Set variables for Before and After SQL
: code in JavaScript to set the variables that can be referenced as {varname} in the Before and/or After SQL. To set the variable add it as a key-value pair to the script variable 'vars'.- Example:
vars.put('schema', 'abc');
- Example of the SQL:
insert into test (schema_name) values ('{schema}')
- Example:
Before load SQL
: this SQL will be executed on the destination Connection before loading data.Ignore errors when executing Before load SQL
: if this parameter is enabled, and there is an error whenBefore load SQL
is executed, the error will be ignored.Before Load SQL is a script
: if this parameter is enabled,Before Load SQL
will be executed as a SQL script.After load SQL
: this SQL will be executed on the destination Connection after loading data into the destination table.Ignore errors when executing After load SQL
: if this parameter is enabled and there is an error whenAfter load SQL
is executed, the error will be ignored.After Load SQL is a script
: if this parameter is enabled,After Load SQL
will be executed as a SQL script.
Handling source schema changes
Alter target table if the source has columns that the target table doesn't have
: if this parameter is enabled (it is disabled by default), and the source has different fields than the destination table, the system will add missing fields to the destination table.Create all actual tables with only TEXT columns
: if this parameter is enabled (it is disabled by default) the flow will create all destination tables with onlyVARCHAR(65000)
columns. Enable this option to make the load more resilient to the source schema drift. By default, the flow creates the actual table using the first file as a pattern. It then alters the actual table by adding new columns but it never modifies the data types of the existing columns which could cause issues if there are multiple source files to load into the same destination table and some of these files have columns with different data types (compare to each other).Actual tables with all TEXT columns
: a comma-separated list of regular expressions matching the table names for which the flow should create all columns asVARCHAR(65000)
.
Here are the rules for regexes:
-
- Regexes are case insensitive.
- If regex includes comma (
,
) it must be escaped (/,
). - Java regex syntax must be used.
- If the destination table name is configured using fully qualified (DB.SCHEMA.TABLE) or partially qualified (SCHEMA.TABLE) notation the regex should reflect it. Example (include tables
country
andcity
):text.*
->.*country,.*city
.
TEXT columns
: a comma-separated list of regular expressions matching the column names which should be created asVARCHAR(65000)
.
Here are the rules for 3 parameters above:
-
-
- If
Create all actual tables with only TEXT columns
is enabled all columns in all new tables and new columns in the existing tables will be created asVARCHAR(65000)
. - If the table name matches the regex in
Actual tables with all TEXT columns
andTEXT columns
is empty all columns in that table will be created asVARCHAR(65000)
. - If
TEXT columns
is not empty and the column name in any table matches any regex configured in this field the column will be created asVARCHAR(65000)
.
- If
-
Create temp tables with only TEXT columns
: if this parameter is enabled (it is disabled by default) the flow will create temporary tables with only TEXT columns. Enable this option to make the load more resilient to the source schema drift. By default, the flow creates the temp table using the first file as a pattern. It then alters the temp table by adding new columns but it never modifies the data types of the existing columns which could cause issues if there are multiple source files to load into the same destination table and some of these files have columns with different data types (compare to each other). This option is automatically enabled ifCreate all actual tables with only TEXT columns
is enabled or if the table name matches the regex configured inActual tables with all TEXT columns
.
Step 7. Optionally, add a mapping.
The mapping can be used to:
- Rename the column.
- Exclude the column.
- Change the column data type for the tables created by the flow.
Configure mapping
Select transformation and click MAPPING
.
Add mapping for column CURRENT_DATE
.
The renaming or excluding the column in mapping will work for all tables that have this column. If the table does not have the column the mapping will be ignored.
Step 8. Optionally, add more source-to-destination transformations
Click [+]
and repeat steps 5 to 6 for more locations and/or file name templates.
Comments
0 comments
Please sign in to leave a comment.