Use this flow to extract data from any supported source, transform it, and bulk-load it into Vertica. Etlworks stages CSV files (locally or in S3 / Azure Blob / GCS) and runs the Vertica COPY command for you, optionally followed by a MERGE.
Note: When to use bulk-load vs. regular ETL. Unlike Snowflake or Redshift, Vertica supports fast INSERT without COPY. For straight inserts, the regular database ETL flow is usually a better fit. Reach for this bulk-load flow specifically when you need MERGE (UPSERT) into Vertica or you're moving very large datasets.
Which Vertica flow should I use?
| Flow | Use when |
|---|---|
|
Any to Vertica (this article) Database / File / Queue / Web service / Well-known API → Vertica |
You need MERGE into Vertica, or to move large datasets with optional transformation. |
| Bulk load files into Vertica | The files already exist in local or cloud storage. No transformation needed. Auto-generates COPY; supports MERGE. |
| Stream CDC events into Vertica | You need real-time replication from a CDC-enabled source database. |
| Streaming with message queues | You need real-time ingestion from a message queue that supports streaming. |
How does this flow work?
- Extracts data from the source.
- Writes CSV files to the staging area: local server storage, Amazon S3, Azure Blob, or Google Cloud Storage.
- Optionally compresses the files with gzip.
- Creates the destination Vertica table from the source metadata if it doesn't exist.
- Runs the user-defined or auto-generated COPY SQL for the bulk load.
- If MERGE is configured, runs the auto-generated or user-defined MERGE statement.
- Cleans up staged files.
Wildcard processing and change replication via high watermark are both supported.
What do I need before I start?
Vertica is up and running and reachable from your Etlworks instance. For on-premise Vertica, see Working with on-premise data.
Step-by-step setup
Step 1. Create connections
You need three connections:
- A source connection.
- A stage connection — choose based on where Vertica is hosted (see table below).
- A Vertica connection. Enable Auto Commit.
| Stage connection type | Use when |
|---|---|
| Server storage | Default. Works for most cases, including when Vertica and Etlworks aren't on the same network. Etlworks runs COPY LOCAL against staged files on the Etlworks server. |
| Amazon S3 | Vertica is hosted on AWS and can read from the S3 bucket. |
| Azure Blob | Vertica is hosted on Azure and can read from the Azure Blob container. |
| Google Cloud Storage | Vertica is hosted on Google Cloud and can read from the GCS bucket. |
For S3 / Azure / GCS stages, set Archive file before copying to GZip.
Step 2. Create a CSV format
Create a CSV format. For large datasets, set Maximum number of rows in file so the flow splits source data into chunks. Smaller files load much faster, especially with parallel threads enabled.
Step 3. Create the flow
Open Flows, click +, type to Vertica in the gallery search, and pick the flow type that matches your source.
Step 4. Set the Vertica connection on the flow
Open the Connections tab and pick the Vertica connection from Step 1.
Step 5. Set source (FROM) and destination (TO)
FROM (varies by flow type):
- Database: table or view name. Wildcards supported, e.g. public.*.
- Well-known API: endpoint or file name.
- Web service: endpoint name.
- File: file name or wildcard pattern.
- Queue: queue name.
TO: the fully qualified Vertica table name. If FROM uses a wildcard, use a matching wildcard in TO.
The stage connection (from Step 1) is the destination connection on the transformation. Etlworks treats it as the Vertica staging area.
Step 6. Configure the COPY SQL
Click MAPPING on the transformation row, go to the Parameters tab, and edit the Copy SQL field.
Default for the local-storage stage is:
COPY {TABLE} FROM LOCAL '{PATH}{FILE_TO_LOAD}.{EXT}'
PARSER fcsvparser(header='true') ABORT ON ERROR;
Override with your own SQL if needed. Available tokens:
- {TABLE} — destination table.
- {TEMP_TABLE} — staging table used for MERGE.
- {FILE_TO_LOAD} — file name without path or extension.
- {PATH} — path to the file without the filename (e.g. {app.data}/test/).
- {EXT} — file extension without the leading dot (e.g. csv).
See the Vertica COPY command reference for all options.
Step 7. Set parameters (optional)
COPY SQL parameters and Action
| Parameter | What it does |
|---|---|
| Action | COPY inserts rows from the file into the destination; MERGE upserts on lookup fields. |
| Lookup Fields | Comma-separated list of columns that uniquely identify a record. Required for MERGE. |
| Predict Lookup Fields | Auto-detect the unique-key columns when Lookup Fields is empty. Less reliable. |
| MERGE SQL | User-defined MERGE SQL. Tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}. |
File cleanup and ELT SQL
| Parameter | What it does |
|---|---|
| Purge File if Success | Deletes the staged file after a successful COPY. |
| Purge File if Error | On by default. Deletes staged files on COPY failure. |
| Before COPY SQL / Ignore errors | SQL to run on the Vertica connection before COPY, with optional error suppression. |
| After COPY SQL / Ignore errors | Same, but after COPY. |
Source-specific parameters
- Database source: see Extract, Transform, and Load (ETL) data.
-
File or web-service source:
- Ignore Transformations: copies files directly to the stage without applying transformations. Faster; transformation steps are lost.
- Delete loaded source files: deletes the source file after a successful load.
- Delete source files on error: deletes the source file if the load fails.
- Well-known API source: see the well-known API options.
Step 8. Configure MERGE (UPSERT) (optional)
- Set Action to MERGE.
- Provide Lookup Fields, or enable Predict Lookup Fields.
For wildcard sources targeting multiple tables, use semicolon-separated per-table lookup fields:
test1.inventory=inventory_id,database_name; test1.payment=payment_id,database_name; test1.rental=rental_id,database_name;
Step 9. Handle source / destination schema differences
| Option | What it does |
|---|---|
| Reorder columns to match target (and update types) | Reorders columns in the data file to match the target and updates types to match the existing columns. |
| Ignore Fields in Source Not in Target | Drops source columns the target doesn't have. |
| Alter Target Table If Source Has Extra Columns | Adds the extra columns to the Vertica table. |
| Recreate Target Table If Source Has Extra Columns | Drops and recreates the target table. Destructive. |
| Insert NULL Into Target Fields Not in Source | Inserts NULL for any target column the source doesn't have. |
| Create Table SQL | Custom SQL or JavaScript template for creating the target table. See overriding the create-table SQL. |
Notifications on schema mismatch
If columns don't match, the flow logs an exception. See sending notifications about failed transformations.
Step 10. Configure mapping (optional)
If you need explicit field-to-field mapping, set it up in Mapping. Auto-mapping by field name works for most flows.
Incremental change replication (high watermark)
Change replication using HWM is supported. Set the High Watermark Field, enable Change Replication, and optionally configure a calculated HWM value.
Load multiple tables with a wildcard
FROM
Create a single source-to-destination transformation and set FROM to a wildcard pattern, e.g. public.*.
TO
Set TO to a Vertica schema with a wildcard, e.g. schema.*.
Include / exclude objects
- Exclude Objects: comma-separated list of objects to skip.
- Include Objects: comma-separated list of objects to keep.
Source query, before/after SQL
Use the {TABLE} token in Source Query, Before COPY SQL, and After COPY SQL.
MERGE with wildcards
Leave Lookup Fields empty and enable Predict Lookup Fields, or specify per-table lookup fields as semicolon-separated pairs.
HWM with wildcards
Use the {TABLE} token in the High Watermark Field Value.
Load large datasets
For billions of rows, three techniques help:
Split CSV files into chunks
Set Maximum number of rows in file on the CSV format. 100,000 is a good starting value. Smaller files load faster.
Enable parallel load
When CSV chunking is enabled, the transformation can load chunked files in parallel. Enable Use Parallel Threads when processing sources by a wildcard under MAPPING → Parameters → Common Parameters.
Enable HWM change replication
With HWM change replication enabled, only changed records are extracted and loaded.
Troubleshooting
See Common issues when loading data into cloud data warehouses for common errors and fixes.