Use this flow to extract data from any supported source, transform it, and load it into Greenplum. Etlworks stages CSV files locally, generates a gpload control file, and runs gpload against your Greenplum cluster.
Which Greenplum flow should I use?
| Flow | Use when |
|---|---|
|
Any to Greenplum (this article) Database / File / Queue / Web service / Well-known API → Greenplum |
You need to extract from any source, optionally transform, and load into Greenplum. |
| Bulk load files into Greenplum | The files already exist in server storage. No transformation needed. Auto-loads into staging tables and MERGEs into the destination. |
| Stream CDC events into Greenplum | You need real-time replication from a CDC-enabled source database. |
| Streaming with message queues | You need real-time ingestion from a message queue that supports streaming. |
Greenplum is a massively parallel Postgres fork. Row-by-row inserts are very slow; the flows above use the gpload utility, which is the supported high-performance load path.
How does this flow work?
- Extracts data from the source.
- Writes CSV files to server storage.
- Generates a gpload control file and runs gpload to load the files into Greenplum.
- Cleans up staged files.
What do I need before I start?
- Greenplum is reachable from your Etlworks instance.
- The gpload utility is installed on the same VM as Etlworks. Contact support@etlworks.com if you need help installing it. Detailed install steps: Install and configure gpload.
Step-by-step setup
Step 1. Create connections
You need three connections:
- A source connection.
- A server storage connection for the stage.
- A Greenplum connection.
Step 2. Create a CSV format
Create a CSV format.
Step 3. Create the flow
Open Flows, click +, type greenplum in the gallery search, and pick the flow type that matches your source.
Add the source-to-destination transformation and continue with the steps below.
Step 4. Set the Greenplum connection on the flow
Open the Connections tab and pick the Greenplum connection from Step 1.
Step 5. Set source (FROM) and destination (TO)
FROM (varies by flow type):
- Database: table or view name. Wildcards supported, e.g. public.*.
- Well-known API: endpoint or file name.
- Web service: endpoint name.
- File: file name or wildcard pattern.
- Queue: queue name.
TO: the fully qualified Greenplum table name. If FROM uses a wildcard, use a matching wildcard in TO.
The server storage connection is the destination connection on the transformation. Etlworks treats it as the stage.
Step 6. Set the CSV format
Select the format you created in Step 2 as the destination format.
Step 7. Set parameters
Required parameters
| Parameter | What it does |
|---|---|
| Load Mode | INSERT loads new rows. MERGE inserts new rows and updates existing ones. UPDATE updates existing rows only. MERGE and UPDATE require lookup fields. |
| Lookup Fields | Comma-separated list of columns that uniquely identify a record. Required for MERGE and UPDATE. Or enable Predict Lookup Fields for auto-detection (less reliable). |
gpload control-file parameters
| Parameter | What it does |
|---|---|
| Content of the Control File | Hardcoded gpload parameters. Etlworks generates a control file automatically by default. Available tokens: {FILE}, {TABLE}, {database}, {host}, {port}, {user}, {password}, {COLUMNS}, {MATCH_COLUMNS}, {UPDATE_COLUMNS}. |
| Format | Source data format: TEXT (default) or CSV. See the Greenplum admin guide for details. |
| String that represents NULL values | Default: \N in TEXT mode, empty string in CSV mode. Any source value matching this string is loaded as NULL. |
| Force NOT NULL | CSV mode. Processes the specified columns as if quoted, so missing values aren't treated as NULL. |
| Encoding | Source data encoding (e.g. SQL_ASCII, an encoding number, or DEFAULT). |
| Max # of Errors to Ignore | Enables single-row error isolation. Format-error rows are discarded until this limit. Constraint errors (e.g. PK violations) still abort the load. |
| Log Errors | true / false. With single-row error isolation, logs format-error rows internally. Read them via gp_read_error_log('table_name'). |
| Truncate prior to loading | If true, gpload truncates the target table before loading. |
| Reuse Tables | If true, gpload keeps the external and staging tables for reuse on subsequent loads. Improves performance for trickle loads. |
| Fully qualified domain name | If true, gpload resolves hostnames to FQDNs. Useful when Greenplum and Etlworks are in different domains. Default false. |
| SSL | If true, gpload starts gpfdist with --ssl and uses the gpfdists:// protocol. |
| SSL Certificates Path | Required when SSL is true. Must contain server.crt (server certificate), server.key (server private key), and root.crt (trusted CAs). |
File cleanup
| Parameter | What it does |
|---|---|
| Purge File if Success | On by default. Deletes the staged file after a successful load. |
| Purge File(s) if Error | On by default. Deletes staged files on load failure. |
ELT — SQL before and after load
| Parameter | What it does |
|---|---|
| Before LOAD SQL / Ignore errors / is a script | SQL to run on the Greenplum connection before the load. |
| After LOAD SQL / Ignore errors / is a script | Same, but after the load. |
Source-specific parameters
- Database source: see Extract, Transform, and Load (ETL) data.
-
File or web-service source:
- Ignore Transformations: copies files directly to server storage. Faster; transformations are skipped.
- Delete loaded source files: deletes the source file after a successful load.
- Delete source files on error: deletes the source file if the load fails.
- Well-known API source: see the well-known API options.
Step 8. Configure MERGE (UPSERT) (optional)
- Set Load Mode to MERGE.
- Provide Lookup Fields, or enable Predict Lookup Fields for auto-detection.
For wildcard sources targeting multiple tables, use semicolon-separated per-table lookup fields:
test1.inventory=inventory_id,database_name; test1.payment=payment_id,database_name; test1.rental=rental_id,database_name;
Step 9. Handle source / destination schema differences
| Option | What it does |
|---|---|
| Reorder CSV Columns During Extract | Reorders columns in the extracted file to match the target. Ignored if the target table doesn't exist. |
| Ignore Fields in Source Not in Target | Drops source columns the target doesn't have. |
| Alter Target Table If Source Has Extra Columns | Adds the extra columns to the Greenplum table. |
| Recreate Target Table If Source Has Extra Columns | Drops and recreates the target table. Destructive. |
| Insert NULL Into Target Fields Not in Source | Inserts NULL for any target column the source doesn't have. |
| Create Table SQL | Custom SQL or JavaScript template for creating the target table. See overriding the create-table SQL. |
Notifications on schema mismatch
If columns don't match, the flow logs an exception. See sending notifications about failed transformations.
Step 10. Configure mapping (optional)
If you need explicit field-to-field mapping, set it up in Mapping. Auto-mapping by field name works for most flows.
Incremental change replication (high watermark)
Change replication using HWM is supported. Set the High Watermark Field, enable Change Replication, and optionally configure a calculated HWM value.
Load multiple tables with a wildcard
FROM
Create a single source-to-destination transformation and set FROM to a wildcard pattern, e.g. public.*.
TO
Set TO to db.schema.*, e.g. dbo.history.*.
Include / exclude objects
- Exclude Objects: comma-separated list of objects to skip.
- Include Objects: comma-separated list of objects to keep.
Source query, before/after SQL
Use the {TABLE} token in Source Query, Before LOAD SQL, and After LOAD SQL.
MERGE with wildcards
Leave Lookup Fields empty and enable Predict Lookup Fields, or specify per-table lookup fields as semicolon-separated pairs.
HWM with wildcards
Use the {TABLE} token in the High Watermark Field Value.
Troubleshooting
See Common issues when loading data into cloud data warehouses for common errors and fixes.