Extract, transform, and load data in Amazon Redshift – Etlworks Support

Use this flow to extract data from any supported source, transform it, and load it into Amazon Redshift at high performance. Etlworks generates and runs the Redshift COPY command for you, optionally followed by a MERGE.

Which Redshift flow should I use?

Etlworks ships several Redshift-optimized flow types. Pick the one that matches your source.

Flow	Use when
Any to Redshift (this article) Database / File / Queue / Web service / Well-known API → Redshift	You need to extract from any source, optionally transform, and load into Redshift.
Bulk load files in S3 into Redshift	The files already exist in S3. No transformation needed. Auto-generates COPY; supports MERGE.
Stream CDC events into Redshift	You need real-time replication from a CDC-enabled source database.
Streaming with message queues	You need real-time ingestion from a message queue that supports streaming.
Load files in Redshift using user-defined COPY command	You have a user-defined Redshift COPY command and want Etlworks to run it. Does not support automatic MERGE.

Redshift is a column-based relational database. Row-by-row inserts are very slow. The flows above use COPY from an S3 stage, which is the supported high-performance load path.

How does this flow work?

An Any-to-Redshift flow executes these steps for each transformation:

Extracts data from the source.
Writes the data to CSV files. (Redshift flows in Etlworks support CSV only as the staging format.)
Compresses the files with gzip.
Uploads the files into an S3 bucket that serves as the Redshift stage.
Creates the destination Redshift table from the source metadata if it doesn't exist.
Generates and runs the Redshift COPY command.
If MERGE is configured, merges the staged data with the existing target table.
Cleans up staged files.

What do I need before I start?

Redshift is running and reachable from your Etlworks instance.
The Redshift user has INSERT privilege on the target table(s).
An S3 bucket exists and Redshift can read from it (via IAM role or access keys).

Step-by-step setup

Step 1. Create connections

You need three connections:

A source connection (database, file storage, web service, queue, or well-known API).
An Amazon S3 connection for the stage. Set Archive file before copying to GZip.
A Redshift connection.

S3 stage connection settings

Step 2. Create a CSV format

Redshift flows in Etlworks load from CSV only. Create a CSV format.

Recommended CSV settings:

Set Value for null field to \N so Redshift COPY can distinguish empty strings from NULL.
For large datasets, split the document into smaller files. Redshift loads files in parallel, and smaller files transfer faster.

Step 3. Create the flow

Open Flows, click +, type redshift in the gallery search, and pick the flow type that matches your source.

Flow gallery search for Redshift

Add the source-to-destination transformation and continue with the steps below.

Redshift transformation

Step 4. Set the Redshift connection on the flow

Open the Connections tab and pick the Redshift connection from Step 1.

Flow Connections tab

Step 5. Set source (FROM) and destination (TO)

FROM (varies by flow type):

Database: table or view name. Wildcards supported, e.g. public.*.
Well-known API: endpoint or file name.
Web service: endpoint name.
File: file name or wildcard pattern.
Queue: queue name.

TO: the fully qualified Redshift table name. If FROM uses a wildcard, use a matching wildcard in TO.

The S3 connection is the destination connection on the transformation. Etlworks treats it as the Redshift stage.

Step 6. Set the CSV format

Select the CSV format you created in Step 2 as the destination format.

Destination format

Step 7. Set parameters (optional)

Click MAPPING on the transformation row, then go to the Parameters tab. The parameters are grouped below by purpose.

COPY behavior

Parameter	What it does
Redshift Table Name	Overrides the destination table name set at the transformation level.
Column List	Comma-separated list of target columns. Lets you load source fields into specific target columns. Columns can be in any order in the COPY statement, but when loading flat files (e.g. from S3), the source order must match this list.
IAM ROLE	Optional AWS IAM role used by Redshift to read the S3 stage. If empty, Etlworks uses the access key / secret from the S3 connection.
Action	COPY INTO inserts new rows; MERGE upserts based on lookup fields.
How to MERGE	DELETE/INSERT (default): deletes matching rows in the target then inserts everything from the temp table. Native MERGE (preview): uses Redshift's native MERGE SQL (currently in AWS preview).
Lookup Fields	Comma-separated list of columns that uniquely identify a record. Required when Action is MERGE.
Predict Lookup Fields	Auto-detect the unique-key columns when Lookup Fields is empty. Detection may not always be accurate.
COPY SQL	User-defined Redshift COPY SQL. Etlworks generates one automatically by default.
MERGE SQL	User-defined MERGE SQL. Available tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}.
CDC MERGE SQL	User-defined CDC-MERGE SQL. Available tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}, {UPDATE_FIELDS}, {UPDATE_CONDITIONS}.

CSV parsing and NULL handling

Parameter	What it does
File Format	Optional inline CSV format definition. See Redshift file formats. If empty, Etlworks generates one from the destination format.
String used to convert to SQL NULL	Maps a source string to SQL NULL on load (Redshift's NULL AS). Default is \N. Use '\0' or '\000' to map the NUL byte. Don't use '\n' — Redshift reserves it as the line delimiter.
Load empty chars as null	Loads empty CHAR / VARCHAR fields (two delimiters with nothing between them) as NULL. Numeric and other types are always loaded as NULL when empty.
Load blank chars as null	Loads whitespace-only CHAR / VARCHAR fields (e.g. three spaces) as NULL.
Accept any date	Allows malformed dates (including 00/00/00 00:00:00) to load into TIMESTAMP / DATE columns without error.
Accept invalid UTF8 chars	Replaces invalid UTF-8 with the replacement char (default ?) instead of failing.
Escape	Treats backslash as an escape character: the next character is loaded literally, even if it's the delimiter, a quote, a newline, or another backslash.
Remove quotes	Strips surrounding quotation marks from string values. Characters inside the quotes (including delimiters) are preserved.
Explicit IDs	Use EXPLICIT_IDS on tables with IDENTITY columns to load explicit values from the source instead of auto-generated ones. If Column List is set, it must include the IDENTITY columns.
Truncate Columns	Truncates strings to fit the target column length instead of failing.

Load options and performance

Parameter	What it does
Refresh of Optimizer Statistics	STATUPDATE behavior. Default: stats are refreshed only when the table was initially empty. ON: always refresh after COPY. OFF: never refresh.
No Load	NOLOAD. Validates files without loading them. Much faster than a real load — useful as a pre-check.
Max # of Errors to Ignore	MAXERROR. 0 (default, strict) to 100,000. Lets the load continue past row-level errors. Set to 0 or 1 to fail on the first bad row.

ELT — SQL before and after COPY

Parameter	What it does
Before COPY SQL	SQL to run on the Redshift connection before the COPY.
Ignore errors for Before COPY SQL	Lets the flow continue if the before-SQL fails.
Before COPY SQL is a script	Treats the field as a full Redshift script, not a single statement.
After COPY SQL / Ignore errors / is a script	Same three options, but run after the COPY.

File cleanup

Parameter	What it does
Purge File if Success	On by default (when not using user-defined SQL). Deletes the staged file after a successful COPY.
Purge File(s) if Error	On by default. Deletes staged files on COPY failure.
Name(s) of the File(s) to Purge	Files to delete on error. Wildcards supported. Defaults to the wildcard used in COPY.

Debugging

Parameter	What it does
Log each executed SQL statement	Logs every auto-generated and executed SQL statement, including before-/after-SQL.

Source-specific parameters

Some parameters only apply to certain source types:

Database source: see Extract, Transform, and Load (ETL) data for source-database options.
File or web-service source:
- Ignore Transformations: copies files directly to the stage without applying transformations. Much faster, but you lose any transformation steps.
- Delete Loaded Source Files: deletes the source file after a successful load.
- Delete Source Files on Error: deletes the source file if the load fails.
Well-known API source: see the well-known API options.

Step 8. Configure MERGE (UPSERT) (optional)

To upsert instead of insert:

Set Action to MERGE.
Provide Lookup Fields — a comma-separated list of columns that uniquely identify a record. Or enable Predict Lookup Fields for auto-detection (less reliable).

For wildcard sources (see Load multiple tables with a wildcard) that target multiple tables, specify per-table lookup fields with fully qualified names, separated by semicolons:

test1.inventory=inventory_id,database_name;
test1.payment=payment_id,database_name;
test1.rental=rental_id,database_name;

Redshift MERGE configuration

Step 9. Handle source / destination schema differences

The source and the Redshift target often have a different column count or order. Redshift's COPY fails when columns don't line up. Etlworks gives you several ways to deal with this:

Option	What it does
Reorder CSV Columns During Extract	Reorders columns in the extracted file to match the Redshift target. Ignored if the target table doesn't exist.
Ignore Fields in Source Not in Target	Drops source columns the target doesn't have. Mutually exclusive with the next two.
Alter Target Table If Source Has Extra Columns	Adds the extra columns to the Redshift table. Mutually exclusive with the previous and next.
Recreate Target Table If Source Has Extra Columns	Drops and recreates the target table. Destructive. Mutually exclusive with the previous two.
Insert NULL Into Target Fields Not in Source	Inserts NULL for any target column the source doesn't have.
Create Table SQL	Custom SQL or JavaScript template for creating the target table. See overriding the create-table SQL.

Notifications on schema mismatch

If the source and target columns don't match, the flow logs an exception. You can wire that exception into an email notification — see sending notifications about failed transformations. The standard exception handler is not required for this case — the schema-mismatch exception is captured in the flow log automatically.

Step 10. Configure mapping (optional)

If you need explicit field-to-field mapping, set it up in Mapping. Mapping is not required; auto-mapping by field name works for most flows.

Redshift has restrictions on field names — if the source contains characters Redshift doesn't accept, COPY fails. For example, Google Analytics output uses ga:user, ga:browser, etc., and Redshift won't accept the colon. Use Mapping to rename those fields before loading.

Mapping example

Incremental change replication (high watermark)

Like any other ETL flow, this flow supports change replication using a high watermark (HWM). Only records that changed since the last run are loaded.

To enable it:

Set the High Watermark Field — the column that tracks changes.
Enable Change Replication on the transformation.
Optional: set a calculated HWM value for clearer tracking.

HWM configuration

Load multiple tables with a wildcard

You can load many database objects in a single transformation by using a wildcard in FROM.

FROM

Create a single source-to-destination transformation.
Set FROM to a fully qualified source object name with wildcards, e.g. public.*.

By default every table and view matching the pattern is included.

Wildcard FROM

TO

Set TO to a Redshift schema with a wildcard, e.g. public.*.

Wildcard TO

Include / exclude objects

Exclude Objects: comma-separated list of objects to skip. Use all views to skip every view; all tables to skip every table.
Include Objects: comma-separated list of objects to keep (everything else is skipped).

Exclude objects

Source query, before/after SQL

Use the {TABLE} token in Source Query, Before COPY SQL, and After COPY SQL to reference each table dynamically.

Wildcard source query

MERGE with wildcards

When MERGE is the action:

Leave Lookup Field empty and enable Predict Lookup Fields, or
Specify per-table lookup fields as semicolon-separated fully_qualified_table=field_list pairs:

test1.inventory=inventory_id,database_name;
test1.payment=payment_id,database_name;
test1.rental=rental_id,database_name;

HWM with wildcards

For incremental replication with a calculated HWM field, use the {TABLE} token in the High Watermark Field Value.

Wildcard HWM configuration

Handle Unicode strings (no NVARCHAR in Redshift)

Redshift has no NVARCHAR type, and VARCHAR(1) is 1 byte rather than 1 character. Unicode strings can be truncated when Etlworks auto-creates the Redshift table from the source.

To compensate, set Varchar Field Size Multiplier on the Redshift connection. The multiplier is applied to the default VARCHAR-like column sizes (except CHAR) when Etlworks creates the destination table. Even with the multiplier, the maximum column length is capped at 65,535.

Varchar Field Size Multiplier

Flow variables

Open the Parameters tab and add user-defined flow variables as key-value pairs. Use uppercase names without spaces.

Maximum number of parallel threads

Multiple source-to-destination transformations (extract-load) can run in parallel threads. The default limit is 5; the maximum is 99. Change it via Maximum Number of Parallel Threads on the Parameters tab.

Troubleshooting

See Common issues when loading data into cloud data warehouses for common errors and fixes.

Articles in this section