Greenplum is an open-source massively parallel data platform for analytics, machine learning, and AI.
It is a fork of Postgres, so you can use the same techniques you would normally use to work with relational databases in Etlworks. However, it is important to understand that inserting data into Greenplum row by row can bepainfully slow.
It is recommended that you use Flows optimized to ETL data into Greenplum.
When you need to bulk-load files that already exist in server storage without applying any transformations. The flow automatically loads data into staging tables and MERGEs data into the destination.
Creates control file and executes Greenplum gpload utility to load data into the Greenplum table.
Cleans up the remaining files, if needed.
Prerequisites
The Greenplum instance must be available from the Internet.
The gpload utility must be installed on the same VM as Etlworks. Contact Etlworks support at support@etlworks.com if you need assistance installing the gpload.
Install and configure Greenplum gpload utility
Read how to install and configure the Greenplum gpload utility and the command to execute the gpload.
To configure the final destination, click the Connections tab, and select the available Greenplum Connection.
Step 5. Set the source and the destination
Depending upon the Flow type, you can select one of the following sources (FROM) for the Greenplum Flow:
Database: use the database table or view name as the source (FROM) name. The wildcard names, for example,public.*are supported.
Well-known API: use the endpoint name or the file name as the source (FROM) name.
Web Service: use the endpoint name as the source (FROM) name.
File: use the source file name or a wildcard file name as the source (FROM) name.
Queue :use the queue name as the source (FROM) name.
For all Greenplum Flows, the destination Connection is going to be a Server Storage Connection. Etlworks uses the destination Connection as a stage.
Select or enter the fully qualified or wildcard Greenplum table name as a destination (TO). If the source (FROM) is entered using a wildcard template, set the destination (TO) to a wildcard as well.
Step 6. Set data exchange Format configured in step 2
Step 7. Set the optional parameters
Click MAPPING in the transformation row and select the Parameters tab.
Required parameters:
Load Mode: there are three available load modes:
INSERT: loads data into the target table.
MERGE: inserts new rows and updates the existing rows.
UPDATE: updates the existing rows. UPDATE and MERGE require a comma-separated list of Lookup Fields.
Lookup Fields: the comma-separated list of fields that uniquely identify the record in the Greenplum table. This field is required if action is set to MERGE or UPDATE. Alternatively, you can enable the which will force the Flow to use various algorithms to automatically predict the fields that uniquely identify the record. Note that it is not always possible to detect the unique fields correctly.
Optional parameters and transformations:
Content of the Control File: optional hardcoded gpload parameters. By default, the Flow will generate a control file automatically. The following tokens can be used as a part of the control file:
{FILE}: the data file name
{TABLE}: the destination table name
{database}: the Greenplum database name
{host}: the Greenplum hostname
{port}: the Greenplum port number
{user}: the Greenplum user name
{password}: the Greenplum password
{COLUMNS}: the destination table column
{MATCH_COLUMNS}: the lookup columns
UPDATE_COLUMNS: the columns to update
Format: specifies the Format of the source data file(s) –– either plain text (TEXT) or comma-separated values (CSV) Format. Defaults to TEXT if not specified. For more information about the Format of the source data, see 'Loading and Unloading Data' in the Greenplum Database Administrator Guide.
String that represents NULL values: specifies the string that represents a NULL value. The default is backslash-N in TEXT mode and an empty value with no quotations in CSV mode. You might prefer an empty string even in TEXT mode for cases where you do not want to distinguish nulls from empty strings. Any source data item that matches this string will be considered a null value.
Force NOT NULL: in CSV mode, processes each specified column as though it were quoted and hence not a NULL value. The default null string in CSV mode (nothing between two delimiters) causes missing values to be evaluated as zero-length strings.
Encoding: character set encoding of the source data. Specify a string constant (such as SQL_ASCII), an integer encoding number, or DEFAULT to use the default client encoding. If not specified, the default client encoding is used. For information about supported character sets, see the Greenplum Database Reference Guide.
Max # of Errors to Ignore: enables single row error isolation mode for this load operation. When enabled, input rows that have Format errors will be discarded provided that the error limit count is not reached on any Greenplum Database segment instance during input processing. If the error limit is not reached, all good rows will be loaded, and any error rows will either be discarded or captured as part of error log information. The default is to abort the load operation on the first error encountered. Note that single row error isolation only applies to data rows with Format errors, for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, will still cause the load operation to abort if encountered. For information about handling load errors, see 'Loading and Unloading Data' in the Greenplum Database Administrator Guide.
Log Errors: value is either true or false. The default value is false. If the value is true, rows with formatting errors are logged internally when running in single row error isolation mode. You can examine formatting errors with the Greenplum Database built-in SQL function gp_read_error_log('table_name'). If formatting errors are detected when loading data, gpload generates a warning message with the name of the table that contains the error information similar to this message.
Truncate prior to loading: if set to true, gpload will remove all rows in the target table prior to loading it.
Reuse Tables: if set to true, gpload will not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. This improves the performance of trickle loads (ongoing small loads to the same target table).
Purge File if Success: if this option is enabled (by default) and the user-defined SQL is not used, the Etlworks will run the COPY INTO command with the purge option enabled. With purge enabled, the staging file will be automatically deleted after the COPY INTO command is successfully executed.
Purge File(s) if Error: if this option is enabled (by default), the staging files will be automatically deleted if a LOADcommand was executed with an error.
Fully qualified domain name: specifies whether gpload resolves hostnames to the fully qualified domain name (FQDN) or the local hostname. If the value is set to true, names are resolved to the FQDN. If the value is set to false, the resolution is to the local hostname. The default is false. A fully qualified domain name might be required in some situations. For example, if the Greenplum Database system is in a different domain than an ETL application that is being accessed by gpload.
SSL: specifies usage of SSL encryption. If SSL is set to true, gpload starts the gpfdist server with the --ssl option and uses the gpfdists:// protocol;
SSL Certificates Path: required when SSL is true; cannot be specified when SSL is false or unspecified. The location specified in CERTIFICATES_PATH must contain the following files:
The server certificate file: server.crt
The server private key file: server.key
The trusted certificate authorities: root.crt
ELT
Before LOAD SQL: this SQL will be executed on the Greenplum Connection before loading data. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.
Ignore errors when executing Before LOAD SQL: if this option is enabled, and there is an error when Before LOAD SQL is executed, the error will be ignored.
Before LOAD SQL is a script: if this option is enabled 'before LOAD SQL' will be executed as ascript.
After LOAD SQL: this SQL will be executed on the Greenplum Connection after loading data. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.
Ignore errors when executing After LOAD SQL: if this option is enabled, and there is an error when After LOAD SQL is executed, the error will be ignored.
After LOAD SQL is a script: if this option is enabled 'after LOAD SQL' will be executed as ascript.
Parameters specific for different source types
Depending on the Flow type, other Flow parameters can be added, as explained below:
Ignore Transformations: if this parameter is enabled, the system will copy files directly into the server storage before loading data into the actual table. This greatly improves the performance of the load but automatically disables any transformations.
Delete loaded source files: if this parameter is enabled, the system will automatically delete loaded files from the source.
Delete source files on error: if this parameter is enabled and there is an error during the load, the system will automatically delete loaded files from the source.
To MERGE (UPSERT) existing data in the Greenplum table with new data:
Set the Action to MERGE.
Define the Lookup Fields, the comma-separated list of fields that uniquely identify the record in the Greenplum table.
Alternatively, you can enable the Predict Lookup Fields which will force the Flow to use various algorithms to automatically predict the fields that uniquely identify the record. Note that it is not always possible to detect the unique fields correctly.
When enabling the Predict Lookup Fields (which is not always accurate) is not an option, you can specify the list of table=fields pairs in the Lookup Fields. Use the fully-qualified table names and ; as a separator between table=fieldpairs.
Step 9. Configure how to handle source and destination schema changes
It is typical when the source (for example, the table in the OLTP database) and the destination (Greenplum table) have a different number of columns. It is also entirely possible that the order of columns in the source is different from that in the destination.
When setting the source-to-destination transformation, it is possible to configure it to automatically handle schema changes. The following configuration options are available:
Reorder CSV Columns During Extract: if this option is enabled (it is disabled by default), the Flow reorders columns in the data file to match the order of columns in the target table. This option is ignored if the target table does not exist.
Ignore fields in the source that do not exist in the target table:if this option is enabled (it is disabled by default) and the source has more fields than the target table, the system will automatically exclude extra fields. Options 2, 3, and 4 are mutually exclusive.
Alter target table if the source has columns that the target table doesn't have: if this option is enabled (it is disabled by default) and the source has different fields than the target table, the system will add extra fields to the target table. Options 2, 3, and 4 are mutually exclusive.
Recreate target table if the source has columns that the target table doesn't have: if this option is enabled (it is disabled by default) and the source has different fields than the target table, the system will automatically drop and recreate the target table. Options 2, 3, and 4 are mutually exclusive.
Insert null into the fields in the target table that do not exist in the source: if this option is enabled (it is disabled by default) and the target table has different fields than the source, the system will automatically insert NULL values into these fields in the target.
Create table SQL: by default system automatically generates an SQL for creating a new table if it does not exist. You can override the default SQL by providing a template or a JavaScript code. Read more about how to override the create table SQL.
Configure notifications if source and destination have different columns
It is possible to send an email notification if the source and destination have different columns. To configure Flow to send a notification when either source has more columns than the destination or the destination has more columns than the source, use the technique explained in this article.
The only difference is that you don't need to configure the exception handler. If the system detects that the source and the destination have a different number of columns, it will add an exception to the log, which can be retrieved and sent in the email body, as explained above.
Step 10. Optionally configure Mapping
If necessary, you can create a Mapping between the source and destination (Greenplum) fields.
You create the Mapping between the source and destination just like you usually do for any other Flow type.
When change replication is enabled, only the changed records will be loaded into Greenplum.
Basically, all you need to do is set the High Watermark Field and Enable Change Replication for the transformation. Additionally, for better reliability, you can set the calculated High Watermark Field Value.
Load multiple tables by a wildcard name
To load multiple database objects (tables and views) by a wildcard name (without creating individual source-to-destination transformations), follow the same steps as above,except:
FROM
Create a single source-to-destination transformation and set the FROMto afully qualified source database object name with wildcard characters (* and ?), for examplepublic.*
By default, all tables and all views that match the wildcard name in FROM will be included. If you want to exclude specific database objects, enter a comma-separated list of the database objects to exclude in the Exclude Objectsfield. If you want to include only specific database objects, enter a comma-separated list of the database objects to include in the Include Objects.
TO
Set the TO to the db.schema.*, for example, dbo.history.*.
Optionally, configure the Source query. You can use token {TABLE} in the Source query.
BEFORE/AFTER SQL
If you want to execute any code before and/or after LOAD, use token {TABLE}in the Before LOAD SQLand After LOAD SQL.
MERGE
If you are configuring MERGE action,do notenter anything in the Lookup Fields. Enable Predict Lookup Fields instead.
An alternative to enabling the Predict Lookup Fields option (which is not always accurate), is by specifying the list of table=fields pairs in the Lookup Fields. Use the fully-qualified table names and ; as a separator between table=fields pairs.
If you are setting up the high watermark change replication with a calculated HWM field, use token{TABLE}in the High Watermark Field Value.
Flow variables
To add user-defined Flow variables open Parameters tab, add variables as key-value pairs. Use uppercase characters and no spaces for variable names.
Maximum number of parallel threads
It is possible to execute multiple source-to-destination transformations (extract-load) inparallel threads.
By default, Etlworks limits the number of threads to 5. To change the limit open Parameters tab, set the max number of threads in Maximum Number of Parallel Threads. This value cannot be greater than 99.
Comments
0 comments
Please sign in to leave a comment.