Use this flow to extract data from any supported source, transform it, and load it into Amazon Redshift at high performance. Etlworks generates and runs the Redshift COPY command for you, optionally followed by a MERGE.
Which Redshift flow should I use?
Etlworks ships several Redshift-optimized flow types. Pick the one that matches your source.
| Flow | Use when |
|---|---|
|
Any to Redshift (this article) Database / File / Queue / Web service / Well-known API → Redshift |
You need to extract from any source, optionally transform, and load into Redshift. |
| Bulk load files in S3 into Redshift | The files already exist in S3. No transformation needed. Auto-generates COPY; supports MERGE. |
| Stream CDC events into Redshift | You need real-time replication from a CDC-enabled source database. |
| Streaming with message queues | You need real-time ingestion from a message queue that supports streaming. |
| Load files in Redshift using user-defined COPY command | You have a user-defined Redshift COPY command and want Etlworks to run it. Does not support automatic MERGE. |
Redshift is a column-based relational database. Row-by-row inserts are very slow. The flows above use COPY from an S3 stage, which is the supported high-performance load path.
How does this flow work?
An Any-to-Redshift flow executes these steps for each transformation:
- Extracts data from the source.
- Writes the data to CSV files. (Redshift flows in Etlworks support CSV only as the staging format.)
- Compresses the files with gzip.
- Uploads the files into an S3 bucket that serves as the Redshift stage.
- Creates the destination Redshift table from the source metadata if it doesn't exist.
- Generates and runs the Redshift COPY command.
- If MERGE is configured, merges the staged data with the existing target table.
- Cleans up staged files.
What do I need before I start?
- Redshift is running and reachable from your Etlworks instance.
- The Redshift user has INSERT privilege on the target table(s).
- An S3 bucket exists and Redshift can read from it (via IAM role or access keys).
Step-by-step setup
Step 1. Create connections
You need three connections:
- A source connection (database, file storage, web service, queue, or well-known API).
- An Amazon S3 connection for the stage. Set Archive file before copying to GZip.
- A Redshift connection.
Step 2. Create a CSV format
Redshift flows in Etlworks load from CSV only. Create a CSV format.
Recommended CSV settings:
- Set Value for null field to \N so Redshift COPY can distinguish empty strings from NULL.
- For large datasets, split the document into smaller files. Redshift loads files in parallel, and smaller files transfer faster.
Step 3. Create the flow
Open Flows, click +, type redshift in the gallery search, and pick the flow type that matches your source.
Add the source-to-destination transformation and continue with the steps below.
Step 4. Set the Redshift connection on the flow
Open the Connections tab and pick the Redshift connection from Step 1.
Step 5. Set source (FROM) and destination (TO)
FROM (varies by flow type):
- Database: table or view name. Wildcards supported, e.g. public.*.
- Well-known API: endpoint or file name.
- Web service: endpoint name.
- File: file name or wildcard pattern.
- Queue: queue name.
TO: the fully qualified Redshift table name. If FROM uses a wildcard, use a matching wildcard in TO.
The S3 connection is the destination connection on the transformation. Etlworks treats it as the Redshift stage.
Step 6. Set the CSV format
Select the CSV format you created in Step 2 as the destination format.
Step 7. Set parameters (optional)
Click MAPPING on the transformation row, then go to the Parameters tab. The parameters are grouped below by purpose.
COPY behavior
| Parameter | What it does |
|---|---|
| Redshift Table Name | Overrides the destination table name set at the transformation level. |
| Column List | Comma-separated list of target columns. Lets you load source fields into specific target columns. Columns can be in any order in the COPY statement, but when loading flat files (e.g. from S3), the source order must match this list. |
| IAM ROLE | Optional AWS IAM role used by Redshift to read the S3 stage. If empty, Etlworks uses the access key / secret from the S3 connection. |
| Action | COPY INTO inserts new rows; MERGE upserts based on lookup fields. |
| How to MERGE | DELETE/INSERT (default): deletes matching rows in the target then inserts everything from the temp table. Native MERGE (preview): uses Redshift's native MERGE SQL (currently in AWS preview). |
| Lookup Fields | Comma-separated list of columns that uniquely identify a record. Required when Action is MERGE. |
| Predict Lookup Fields | Auto-detect the unique-key columns when Lookup Fields is empty. Detection may not always be accurate. |
| COPY SQL | User-defined Redshift COPY SQL. Etlworks generates one automatically by default. |
| MERGE SQL | User-defined MERGE SQL. Available tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}. |
| CDC MERGE SQL | User-defined CDC-MERGE SQL. Available tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}, {UPDATE_FIELDS}, {UPDATE_CONDITIONS}. |
CSV parsing and NULL handling
| Parameter | What it does |
|---|---|
| File Format | Optional inline CSV format definition. See Redshift file formats. If empty, Etlworks generates one from the destination format. |
| String used to convert to SQL NULL | Maps a source string to SQL NULL on load (Redshift's NULL AS). Default is \N. Use '\0' or '\000' to map the NUL byte. Don't use '\n' — Redshift reserves it as the line delimiter. |
| Load empty chars as null | Loads empty CHAR / VARCHAR fields (two delimiters with nothing between them) as NULL. Numeric and other types are always loaded as NULL when empty. |
| Load blank chars as null | Loads whitespace-only CHAR / VARCHAR fields (e.g. three spaces) as NULL. |
| Accept any date | Allows malformed dates (including 00/00/00 00:00:00) to load into TIMESTAMP / DATE columns without error. |
| Accept invalid UTF8 chars | Replaces invalid UTF-8 with the replacement char (default ?) instead of failing. |
| Escape | Treats backslash as an escape character: the next character is loaded literally, even if it's the delimiter, a quote, a newline, or another backslash. |
| Remove quotes | Strips surrounding quotation marks from string values. Characters inside the quotes (including delimiters) are preserved. |
| Explicit IDs | Use EXPLICIT_IDS on tables with IDENTITY columns to load explicit values from the source instead of auto-generated ones. If Column List is set, it must include the IDENTITY columns. |
| Truncate Columns | Truncates strings to fit the target column length instead of failing. |
Load options and performance
| Parameter | What it does |
|---|---|
| Refresh of Optimizer Statistics | STATUPDATE behavior. Default: stats are refreshed only when the table was initially empty. ON: always refresh after COPY. OFF: never refresh. |
| No Load | NOLOAD. Validates files without loading them. Much faster than a real load — useful as a pre-check. |
| Max # of Errors to Ignore | MAXERROR. 0 (default, strict) to 100,000. Lets the load continue past row-level errors. Set to 0 or 1 to fail on the first bad row. |
ELT — SQL before and after COPY
| Parameter | What it does |
|---|---|
| Before COPY SQL | SQL to run on the Redshift connection before the COPY. |
| Ignore errors for Before COPY SQL | Lets the flow continue if the before-SQL fails. |
| Before COPY SQL is a script | Treats the field as a full Redshift script, not a single statement. |
| After COPY SQL / Ignore errors / is a script | Same three options, but run after the COPY. |
File cleanup
| Parameter | What it does |
|---|---|
| Purge File if Success | On by default (when not using user-defined SQL). Deletes the staged file after a successful COPY. |
| Purge File(s) if Error | On by default. Deletes staged files on COPY failure. |
| Name(s) of the File(s) to Purge | Files to delete on error. Wildcards supported. Defaults to the wildcard used in COPY. |
Debugging
| Parameter | What it does |
|---|---|
| Log each executed SQL statement | Logs every auto-generated and executed SQL statement, including before-/after-SQL. |
Source-specific parameters
Some parameters only apply to certain source types:
- Database source: see Extract, Transform, and Load (ETL) data for source-database options.
-
File or web-service source:
- Ignore Transformations: copies files directly to the stage without applying transformations. Much faster, but you lose any transformation steps.
- Delete Loaded Source Files: deletes the source file after a successful load.
- Delete Source Files on Error: deletes the source file if the load fails.
- Well-known API source: see the well-known API options.
Step 8. Configure MERGE (UPSERT) (optional)
To upsert instead of insert:
- Set Action to MERGE.
- Provide Lookup Fields — a comma-separated list of columns that uniquely identify a record. Or enable Predict Lookup Fields for auto-detection (less reliable).
For wildcard sources (see Load multiple tables with a wildcard) that target multiple tables, specify per-table lookup fields with fully qualified names, separated by semicolons:
test1.inventory=inventory_id,database_name; test1.payment=payment_id,database_name; test1.rental=rental_id,database_name;
Step 9. Handle source / destination schema differences
The source and the Redshift target often have a different column count or order. Redshift's COPY fails when columns don't line up. Etlworks gives you several ways to deal with this:
| Option | What it does |
|---|---|
| Reorder CSV Columns During Extract | Reorders columns in the extracted file to match the Redshift target. Ignored if the target table doesn't exist. |
| Ignore Fields in Source Not in Target | Drops source columns the target doesn't have. Mutually exclusive with the next two. |
| Alter Target Table If Source Has Extra Columns | Adds the extra columns to the Redshift table. Mutually exclusive with the previous and next. |
| Recreate Target Table If Source Has Extra Columns | Drops and recreates the target table. Destructive. Mutually exclusive with the previous two. |
| Insert NULL Into Target Fields Not in Source | Inserts NULL for any target column the source doesn't have. |
| Create Table SQL | Custom SQL or JavaScript template for creating the target table. See overriding the create-table SQL. |
Notifications on schema mismatch
If the source and target columns don't match, the flow logs an exception. You can wire that exception into an email notification — see sending notifications about failed transformations. The standard exception handler is not required for this case — the schema-mismatch exception is captured in the flow log automatically.
Step 10. Configure mapping (optional)
If you need explicit field-to-field mapping, set it up in Mapping. Mapping is not required; auto-mapping by field name works for most flows.
Redshift has restrictions on field names — if the source contains characters Redshift doesn't accept, COPY fails. For example, Google Analytics output uses ga:user, ga:browser, etc., and Redshift won't accept the colon. Use Mapping to rename those fields before loading.
Incremental change replication (high watermark)
Like any other ETL flow, this flow supports change replication using a high watermark (HWM). Only records that changed since the last run are loaded.
To enable it:
- Set the High Watermark Field — the column that tracks changes.
- Enable Change Replication on the transformation.
- Optional: set a calculated HWM value for clearer tracking.
Load multiple tables with a wildcard
You can load many database objects in a single transformation by using a wildcard in FROM.
FROM
- Create a single source-to-destination transformation.
- Set FROM to a fully qualified source object name with wildcards, e.g. public.*.
By default every table and view matching the pattern is included.
TO
Set TO to a Redshift schema with a wildcard, e.g. public.*.
Include / exclude objects
- Exclude Objects: comma-separated list of objects to skip. Use all views to skip every view; all tables to skip every table.
- Include Objects: comma-separated list of objects to keep (everything else is skipped).
Source query, before/after SQL
Use the {TABLE} token in Source Query, Before COPY SQL, and After COPY SQL to reference each table dynamically.
MERGE with wildcards
When MERGE is the action:
- Leave Lookup Field empty and enable Predict Lookup Fields, or
- Specify per-table lookup fields as semicolon-separated fully_qualified_table=field_list pairs:
test1.inventory=inventory_id,database_name; test1.payment=payment_id,database_name; test1.rental=rental_id,database_name;
HWM with wildcards
For incremental replication with a calculated HWM field, use the {TABLE} token in the High Watermark Field Value.
Handle Unicode strings (no NVARCHAR in Redshift)
Redshift has no NVARCHAR type, and VARCHAR(1) is 1 byte rather than 1 character. Unicode strings can be truncated when Etlworks auto-creates the Redshift table from the source.
To compensate, set Varchar Field Size Multiplier on the Redshift connection. The multiplier is applied to the default VARCHAR-like column sizes (except CHAR) when Etlworks creates the destination table. Even with the multiplier, the maximum column length is capped at 65,535.
Flow variables
Open the Parameters tab and add user-defined flow variables as key-value pairs. Use uppercase names without spaces.
Maximum number of parallel threads
Multiple source-to-destination transformations (extract-load) can run in parallel threads. The default limit is 5; the maximum is 99. Change it via Maximum Number of Parallel Threads on the Parameters tab.
Troubleshooting
See Common issues when loading data into cloud data warehouses for common errors and fixes.