Extract, transform, and load data in Greenplum – Etlworks Support

Use this flow to extract data from any supported source, transform it, and load it into Greenplum. Etlworks stages CSV files locally, generates a gpload control file, and runs gpload against your Greenplum cluster.

Which Greenplum flow should I use?

Flow	Use when
Any to Greenplum (this article) Database / File / Queue / Web service / Well-known API → Greenplum	You need to extract from any source, optionally transform, and load into Greenplum.
Bulk load files into Greenplum	The files already exist in server storage. No transformation needed. Auto-loads into staging tables and MERGEs into the destination.
Stream CDC events into Greenplum	You need real-time replication from a CDC-enabled source database.
Streaming with message queues	You need real-time ingestion from a message queue that supports streaming.

Greenplum is a massively parallel Postgres fork. Row-by-row inserts are very slow; the flows above use the gpload utility, which is the supported high-performance load path.

How does this flow work?

Extracts data from the source.
Writes CSV files to server storage.
Generates a gpload control file and runs gpload to load the files into Greenplum.
Cleans up staged files.

What do I need before I start?

Greenplum is reachable from your Etlworks instance.
The gpload utility is installed on the same VM as Etlworks. Contact support@etlworks.com if you need help installing it. Detailed install steps: Install and configure gpload.

Step-by-step setup

Step 1. Create connections

You need three connections:

A source connection.
A server storage connection for the stage.
A Greenplum connection.

Step 2. Create a CSV format

Create a CSV format.

Step 3. Create the flow

Open Flows, click +, type greenplum in the gallery search, and pick the flow type that matches your source.

Flow gallery search

Add the source-to-destination transformation and continue with the steps below.

Transformation

Step 4. Set the Greenplum connection on the flow

Open the Connections tab and pick the Greenplum connection from Step 1.

Flow Connections tab

Step 5. Set source (FROM) and destination (TO)

FROM (varies by flow type):

Database: table or view name. Wildcards supported, e.g. public.*.
Well-known API: endpoint or file name.
Web service: endpoint name.
File: file name or wildcard pattern.
Queue: queue name.

TO: the fully qualified Greenplum table name. If FROM uses a wildcard, use a matching wildcard in TO.

The server storage connection is the destination connection on the transformation. Etlworks treats it as the stage.

FROM and TO

Step 6. Set the CSV format

Select the format you created in Step 2 as the destination format.

Step 7. Set parameters

Required parameters

Parameter	What it does
Load Mode	INSERT loads new rows. MERGE inserts new rows and updates existing ones. UPDATE updates existing rows only. MERGE and UPDATE require lookup fields.
Lookup Fields	Comma-separated list of columns that uniquely identify a record. Required for MERGE and UPDATE. Or enable Predict Lookup Fields for auto-detection (less reliable).

gpload control-file parameters

Parameter	What it does
Content of the Control File	Hardcoded gpload parameters. Etlworks generates a control file automatically by default. Available tokens: {FILE}, {TABLE}, {database}, {host}, {port}, {user}, {password}, {COLUMNS}, {MATCH_COLUMNS}, {UPDATE_COLUMNS}.
Format	Source data format: TEXT (default) or CSV. See the Greenplum admin guide for details.
String that represents NULL values	Default: \N in TEXT mode, empty string in CSV mode. Any source value matching this string is loaded as NULL.
Force NOT NULL	CSV mode. Processes the specified columns as if quoted, so missing values aren't treated as NULL.
Encoding	Source data encoding (e.g. SQL_ASCII, an encoding number, or DEFAULT).
Max # of Errors to Ignore	Enables single-row error isolation. Format-error rows are discarded until this limit. Constraint errors (e.g. PK violations) still abort the load.
Log Errors	true / false. With single-row error isolation, logs format-error rows internally. Read them via gp_read_error_log('table_name').
Truncate prior to loading	If true, gpload truncates the target table before loading.
Reuse Tables	If true, gpload keeps the external and staging tables for reuse on subsequent loads. Improves performance for trickle loads.
Fully qualified domain name	If true, gpload resolves hostnames to FQDNs. Useful when Greenplum and Etlworks are in different domains. Default false.
SSL	If true, gpload starts gpfdist with --ssl and uses the gpfdists:// protocol.
SSL Certificates Path	Required when SSL is true. Must contain server.crt (server certificate), server.key (server private key), and root.crt (trusted CAs).

File cleanup

Parameter	What it does
Purge File if Success	On by default. Deletes the staged file after a successful load.
Purge File(s) if Error	On by default. Deletes staged files on load failure.

ELT — SQL before and after load

Parameter	What it does
Before LOAD SQL / Ignore errors / is a script	SQL to run on the Greenplum connection before the load.
After LOAD SQL / Ignore errors / is a script	Same, but after the load.

Source-specific parameters

Database source: see Extract, Transform, and Load (ETL) data.
File or web-service source:
- Ignore Transformations: copies files directly to server storage. Faster; transformations are skipped.
- Delete loaded source files: deletes the source file after a successful load.
- Delete source files on error: deletes the source file if the load fails.
Well-known API source: see the well-known API options.

Step 8. Configure MERGE (UPSERT) (optional)

Set Load Mode to MERGE.
Provide Lookup Fields, or enable Predict Lookup Fields for auto-detection.

For wildcard sources targeting multiple tables, use semicolon-separated per-table lookup fields:

test1.inventory=inventory_id,database_name;
test1.payment=payment_id,database_name;
test1.rental=rental_id,database_name;

Greenplum MERGE configuration

Step 9. Handle source / destination schema differences

Option	What it does
Reorder CSV Columns During Extract	Reorders columns in the extracted file to match the target. Ignored if the target table doesn't exist.
Ignore Fields in Source Not in Target	Drops source columns the target doesn't have.
Alter Target Table If Source Has Extra Columns	Adds the extra columns to the Greenplum table.
Recreate Target Table If Source Has Extra Columns	Drops and recreates the target table. Destructive.
Insert NULL Into Target Fields Not in Source	Inserts NULL for any target column the source doesn't have.
Create Table SQL	Custom SQL or JavaScript template for creating the target table. See overriding the create-table SQL.

Notifications on schema mismatch

If columns don't match, the flow logs an exception. See sending notifications about failed transformations.

Step 10. Configure mapping (optional)

If you need explicit field-to-field mapping, set it up in Mapping. Auto-mapping by field name works for most flows.

Incremental change replication (high watermark)

Change replication using HWM is supported. Set the High Watermark Field, enable Change Replication, and optionally configure a calculated HWM value.

HWM configuration

Load multiple tables with a wildcard

FROM

Create a single source-to-destination transformation and set FROM to a wildcard pattern, e.g. public.*.

Wildcard FROM

TO

Set TO to db.schema.*, e.g. dbo.history.*.

Wildcard TO

Include / exclude objects

Exclude Objects: comma-separated list of objects to skip.
Include Objects: comma-separated list of objects to keep.

Source query, before/after SQL

Use the {TABLE} token in Source Query, Before LOAD SQL, and After LOAD SQL.

MERGE with wildcards

Leave Lookup Fields empty and enable Predict Lookup Fields, or specify per-table lookup fields as semicolon-separated pairs.

HWM with wildcards

Use the {TABLE} token in the High Watermark Field Value.

Troubleshooting

See Common issues when loading data into cloud data warehouses for common errors and fixes.

Articles in this section