Bulk load files in server storage into Greenplum – Etlworks Support

If your files are already in a stage that Greenplum can read, and you don't need any transformation, use this flow. Etlworks reads the file list, optionally creates the destination table, generates the LOAD command, and (optionally) MERGEs the data.

When to use this flow

You want to load files (CSV) that already exist in server storage on the same VM as Etlworks — no transformation needed, no over-the-network transfer.
You want to load files produced by a CDC flow into Greenplum.

For a flow that also extracts and transforms data on the way in, use Extract, transform, and load data in Greenplum instead.

How does this flow work?

Reads the names of all files matching the wildcard in the stage. Can recurse into subfolders.
Derives destination table names from the source file names.
Spawns one thread per destination table (up to a configurable limit) and loads them in parallel.
For each table: loads the staged files into a temporary table via LOAD, then merges into the actual destination table.

Alternatively, the flow can skip the temp-table step and load directly into the actual destination table — MERGE and CDC MERGE aren't available in direct-load mode.

Features

Supports COPY INTO (INSERT), MERGE (UPSERT), and CDC MERGE. CDC MERGE applies INSERT / UPDATE / DELETE events in the exact order they were captured from the source database.

Feature	What it does
Monitor source schema changes	Auto-CREATE and ALTER destination tables as source schemas evolve.
Load directly into the destination table	Skip the temporary-table step. Faster, but MERGE / CDC MERGE aren't available.
Delete loaded source files	Remove successfully loaded source files after the load.
Load data in parallel	Multiple destination tables load in parallel threads.
Process files in subfolders	Walk nested subdirectories during the file scan.

What do I need before I start?

Greenplum is reachable from your Etlworks instance.
The gpload utility is installed on the same VM as Etlworks. See Install and configure gpload in the Get started guide.
The Greenplum user has INSERT on the target tables.

Step-by-step setup

Step 1. Create the stage connection

Create a server storage connection. Greenplum loads via the gpload utility, which reads files from the local file system.

Step 2. Create the Greenplum connection

See the connector reference for full setup.

Step 3. Create the format

For most cases a CSV format is fine.

Step 4. Create the flow

Open Flows, click Add flow, type bulk load greenplum in the gallery search, and select the bulk-load flow type.

Step 5. Configure the load transformation

Add a source-to-destination transformation. Source connection = the stage from Step 1; destination connection = the Greenplum connection from Step 2.

FROM: the wildcard pattern for the staged files (e.g. folder/*.csv or folder/**/*.csv for nested folders).

TO: the destination table pattern. Use a wildcard like schema.* to let Etlworks derive the table name from each file.

Step 6. Set parameters

Source files and destination tables

Parameter	What it does
Calculate Greenplum table name	How the destination table name is derived from each source filename (none / strip extension / strip extension and timestamp / custom JavaScript).
Include files in subfolders	Recurse into nested directories during the file scan.
Maximum number of files to process	Hard cap on how many files are processed per flow run.
Maximum number of parallel threads	How many destination tables load concurrently. Default 5; max 99.

Load and MERGE

Parameter	What it does
Action	LOAD inserts new rows; MERGE upserts on lookup fields; CDC MERGE applies CDC change events.
Lookup Fields	Comma-separated list of columns that uniquely identify a record. Required for MERGE and CDC MERGE. Can also be specified per-table as semicolon-separated fully_qualified_table=field_list pairs.
Predict Lookup Fields	Auto-detect the unique-key columns when Lookup Fields is empty.
Load directly into destination table	Skips the temp-table step. MERGE / CDC MERGE not available in this mode.

ELT — SQL before and after the load

Parameter	What it does
Before LOAD SQL / Ignore errors / is a script	SQL to run on the Greenplum connection before the load.
After LOAD SQL / Ignore errors / is a script	Same, but after the load.

Handling source schema changes

Option	What it does
Alter target table if source has extra columns	Adds missing columns to the destination table.
Recreate target table if source has extra columns	Drops and recreates the destination. Destructive.
Insert NULL into target fields not in source	Inserts NULL for any target column the source file doesn't have.
Ignore fields in source not in target	Drops source columns the destination doesn't have.

Debug and error recovery

Parameter	What it does
Log each executed SQL statement	Logs every generated and executed SQL statement.
On Error	Behavior on a per-file or per-statement error: stop, continue, or skip the file (warehouse-specific options).
Purge File(s) if Error	On by default. Deletes staged files on load failure.

Step 7. Optionally configure mapping

If you need explicit field-to-field mapping, set it up in Mapping. Auto-mapping by field name works for most flows.

Step 8. Optionally add more transformations

You can add multiple source-to-destination transformations to a single flow to load different file patterns into different table sets.

Handling processed files

Three options for what to do with source files once they've loaded successfully.

Delete processed files

Enable Delete loaded source files. The flow removes each source file as soon as its load completes successfully. Useful for one-shot loads.

Skip already-processed files

Enable Do not process files that have been already processed. Etlworks tracks loaded filenames internally and skips files that have already been processed. Works well for incremental ingest where source files are appended over time.

Move processed files

Set Move files after load and pick a target folder. After a successful load, the source files are moved (rather than deleted), keeping an archive.

Troubleshooting

See Common issues when loading data into cloud data warehouses for common errors and fixes.

Articles in this section