Extract, transform, and load data in Vertica – Etlworks Support

Use this flow to extract data from any supported source, transform it, and bulk-load it into Vertica. Etlworks stages CSV files (locally or in S3 / Azure Blob / GCS) and runs the Vertica COPY command for you, optionally followed by a MERGE.

Note: When to use bulk-load vs. regular ETL. Unlike Snowflake or Redshift, Vertica supports fast INSERT without COPY. For straight inserts, the regular database ETL flow is usually a better fit. Reach for this bulk-load flow specifically when you need MERGE (UPSERT) into Vertica or you're moving very large datasets.

Which Vertica flow should I use?

Flow	Use when
Any to Vertica (this article) Database / File / Queue / Web service / Well-known API → Vertica	You need MERGE into Vertica, or to move large datasets with optional transformation.
Bulk load files into Vertica	The files already exist in local or cloud storage. No transformation needed. Auto-generates COPY; supports MERGE.
Stream CDC events into Vertica	You need real-time replication from a CDC-enabled source database.
Streaming with message queues	You need real-time ingestion from a message queue that supports streaming.

How does this flow work?

Extracts data from the source.
Writes CSV files to the staging area: local server storage, Amazon S3, Azure Blob, or Google Cloud Storage.
Optionally compresses the files with gzip.
Creates the destination Vertica table from the source metadata if it doesn't exist.
Runs the user-defined or auto-generated COPY SQL for the bulk load.
If MERGE is configured, runs the auto-generated or user-defined MERGE statement.
Cleans up staged files.

Wildcard processing and change replication via high watermark are both supported.

What do I need before I start?

Vertica is up and running and reachable from your Etlworks instance. For on-premise Vertica, see Working with on-premise data.

Step-by-step setup

Step 1. Create connections

You need three connections:

A source connection.
A stage connection — choose based on where Vertica is hosted (see table below).
A Vertica connection. Enable Auto Commit.

Stage connection type	Use when
Server storage	Default. Works for most cases, including when Vertica and Etlworks aren't on the same network. Etlworks runs COPY LOCAL against staged files on the Etlworks server.
Amazon S3	Vertica is hosted on AWS and can read from the S3 bucket.
Azure Blob	Vertica is hosted on Azure and can read from the Azure Blob container.
Google Cloud Storage	Vertica is hosted on Google Cloud and can read from the GCS bucket.

For S3 / Azure / GCS stages, set Archive file before copying to GZip.

Step 2. Create a CSV format

Create a CSV format. For large datasets, set Maximum number of rows in file so the flow splits source data into chunks. Smaller files load much faster, especially with parallel threads enabled.

CSV format settings

Step 3. Create the flow

Open Flows, click +, type to Vertica in the gallery search, and pick the flow type that matches your source.

Flow gallery search

Step 4. Set the Vertica connection on the flow

Open the Connections tab and pick the Vertica connection from Step 1.

Flow Connections tab

Step 5. Set source (FROM) and destination (TO)

FROM (varies by flow type):

Database: table or view name. Wildcards supported, e.g. public.*.
Well-known API: endpoint or file name.
Web service: endpoint name.
File: file name or wildcard pattern.
Queue: queue name.

TO: the fully qualified Vertica table name. If FROM uses a wildcard, use a matching wildcard in TO.

The stage connection (from Step 1) is the destination connection on the transformation. Etlworks treats it as the Vertica staging area.

FROM and TO

Step 6. Configure the COPY SQL

Click MAPPING on the transformation row, go to the Parameters tab, and edit the Copy SQL field.

Default for the local-storage stage is:

COPY {TABLE} FROM LOCAL '{PATH}{FILE_TO_LOAD}.{EXT}'
PARSER fcsvparser(header='true') ABORT ON ERROR;

Override with your own SQL if needed. Available tokens:

{TABLE} — destination table.
{TEMP_TABLE} — staging table used for MERGE.
{FILE_TO_LOAD} — file name without path or extension.
{PATH} — path to the file without the filename (e.g. {app.data}/test/).
{EXT} — file extension without the leading dot (e.g. csv).

See the Vertica COPY command reference for all options.

Step 7. Set parameters (optional)

COPY SQL parameters and Action

Parameter	What it does
Action	COPY inserts rows from the file into the destination; MERGE upserts on lookup fields.
Lookup Fields	Comma-separated list of columns that uniquely identify a record. Required for MERGE.
Predict Lookup Fields	Auto-detect the unique-key columns when Lookup Fields is empty. Less reliable.
MERGE SQL	User-defined MERGE SQL. Tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}.

File cleanup and ELT SQL

Parameter	What it does
Purge File if Success	Deletes the staged file after a successful COPY.
Purge File if Error	On by default. Deletes staged files on COPY failure.
Before COPY SQL / Ignore errors	SQL to run on the Vertica connection before COPY, with optional error suppression.
After COPY SQL / Ignore errors	Same, but after COPY.

Source-specific parameters

Database source: see Extract, Transform, and Load (ETL) data.
File or web-service source:
- Ignore Transformations: copies files directly to the stage without applying transformations. Faster; transformation steps are lost.
- Delete loaded source files: deletes the source file after a successful load.
- Delete source files on error: deletes the source file if the load fails.
Well-known API source: see the well-known API options.

Step 8. Configure MERGE (UPSERT) (optional)

Set Action to MERGE.
Provide Lookup Fields, or enable Predict Lookup Fields.

For wildcard sources targeting multiple tables, use semicolon-separated per-table lookup fields:

test1.inventory=inventory_id,database_name;
test1.payment=payment_id,database_name;
test1.rental=rental_id,database_name;

Vertica MERGE configuration

Step 9. Handle source / destination schema differences

Option	What it does
Reorder columns to match target (and update types)	Reorders columns in the data file to match the target and updates types to match the existing columns.
Ignore Fields in Source Not in Target	Drops source columns the target doesn't have.
Alter Target Table If Source Has Extra Columns	Adds the extra columns to the Vertica table.
Recreate Target Table If Source Has Extra Columns	Drops and recreates the target table. Destructive.
Insert NULL Into Target Fields Not in Source	Inserts NULL for any target column the source doesn't have.
Create Table SQL	Custom SQL or JavaScript template for creating the target table. See overriding the create-table SQL.

Notifications on schema mismatch

If columns don't match, the flow logs an exception. See sending notifications about failed transformations.

Step 10. Configure mapping (optional)

If you need explicit field-to-field mapping, set it up in Mapping. Auto-mapping by field name works for most flows.

Incremental change replication (high watermark)

Change replication using HWM is supported. Set the High Watermark Field, enable Change Replication, and optionally configure a calculated HWM value.

HWM configuration

Load multiple tables with a wildcard

FROM

Create a single source-to-destination transformation and set FROM to a wildcard pattern, e.g. public.*.

Wildcard FROM

TO

Set TO to a Vertica schema with a wildcard, e.g. schema.*.

Wildcard TO

Include / exclude objects

Exclude Objects: comma-separated list of objects to skip.
Include Objects: comma-separated list of objects to keep.

Source query, before/after SQL

Use the {TABLE} token in Source Query, Before COPY SQL, and After COPY SQL.

MERGE with wildcards

Leave Lookup Fields empty and enable Predict Lookup Fields, or specify per-table lookup fields as semicolon-separated pairs.

HWM with wildcards

Use the {TABLE} token in the High Watermark Field Value.

Load large datasets

For billions of rows, three techniques help:

Split CSV files into chunks

Set Maximum number of rows in file on the CSV format. 100,000 is a good starting value. Smaller files load faster.

CSV chunking

Enable parallel load

When CSV chunking is enabled, the transformation can load chunked files in parallel. Enable Use Parallel Threads when processing sources by a wildcard under MAPPING → Parameters → Common Parameters.

Enable HWM change replication

With HWM change replication enabled, only changed records are extracted and loaded.

Troubleshooting

See Common issues when loading data into cloud data warehouses for common errors and fixes.

Articles in this section