When to use this Flow type
If you already have CSV files somewhere in an attached files system (server storage) and don't need to transform the data, the most efficient way to load these files into Greenplum is to use the Flow type Bulk load files into Greenplum
described in this article.
Use this flow type
- When you simply want to load CSV files that already exist in the server storage without applying any transformations or transferring files over the network.
- When you want to load CSV files created by the CDC Flow.
Flows optimized for Greenplum
Flow type | When to use | |
|
|
When you need to extract data from any source, transform it and load it into Greenplum. |
Bulk load files into Greenplum | You are here | When you need to bulk-load files that already exist in server storage without applying any transformations. The flow automatically loads data into staging tables and MERGEs data into the destination. |
Stream CDC events into Greenplum | When you need to stream updates from the database which supports Change Data Capture (CDC) into Greenplum in real time. | |
Stream messages from a queue into Greenplum | When you need to stream messages from the message queue which supports streaming into Greenplum in real time. |
How it works
- The Flow reads the names of all files matching the wildcard in the specific server storage location. It can traverse the subfolders as well.
- The Flow calculates the destination table names based on the source file names.
- The Flow creates multiple threads, one per destination table but not more than the configurable threshold.
- Creates control file and executes Greenplum gpload utility to load data into the staging table in Greenplum –– one staging table per actual destination table.
- The Flow mergers data in the staging tables with data in the actual tables.
Features
The Flow supports INSERT
, MERGE
(UPSERT), and CDC MERGE
. With CDC
MERGE
, the Flow updates the actual destination table by applying INSERT
/UPDATE
/DELETE
events in the same order as they originated in the source database. It performs it in the most efficient way possible.
Other Features | Description |
Monitor source schema changes | The Flow is able to CREATE and ALTER destination tables to keep up with schema changes in the source. |
Load CSV files | The flow can load CSV files into Greenplum using the gpload utility |
Delete loaded source files | The Flow can be configured to delete successfully loaded source files. |
Load data in parallel | The Flow is able to load data into multiple destination tables in parallel. |
Process files in subfolders | The Flow is able to process files in multiple nested subfolders. |
Prerequisites
- The Greenplum instance must be available from the Internet.
- The gpload utility must be installed on the same VM as Etlworks. Contact Etlworks support
support@etlworks.com
if you need assistance installing the gpload.
Install and configure Greenplum gpload utility
Read how to install and configure the Greenplum gpload utility and the command to execute the gpload.
Process
Step 1. Create a new Server Storage Connection
This Connection will be used as a source to access the data files. Read more about the Server Storage connector.
Step 2. Create a new Greenplum Connection
Greenplum Connection will be used as a destination.
Step 3. Create a new CSV Format or Parquet Format
This Format will be used to generate a control file for Greenplum gpload utility.
Step 4. Create a new Flow
This Flow will be used to load files in server storage into Greenplum.
In Flows
click [+]
, type in bulk load files into greenplum
, and select the Flow Bulk load files into Greenplum
.
Step 5. Configure load transformation
Select or enter the following attributes of the transformation (left to right):
- Server Storage Connection created in step 1.
- CSV Format created in step 3.
- A wildcard filename that matches the filenames in the cloud storage.
- Greenplum Connection created in step 2.
- The wildcard destination table name in the following format:
schema.prefix_*_suffix
, whereschema
is a Greenplum schema to load data into. Example:public.*
. You don't need to use wildcards when loading into a specific table.
Step 6. Set optional and required parameters
Click MAPPING
.
Select the Parameters
tab.
Select or enter the following optional and required parameters:
Source Files and Destination Tables
Include files in subfolders
: if this option is enabled, the Flow will load all files that match the wildcard filename (set inFROM
) in all subfolders under the root folder.Exclude and Exclude files
: the optional comma-separated list of files to exclude and/or include. You can use wildcard file names. These options used together with the wildcard filename set inFROM
:- The Flow populates the list of files matching a wildcard filename in
FROM
. - The Flow excludes files based on the list of exclusions set in
Exclude files
. - The Flow includes files based on the list of inclusion set in
Include files
.
- The Flow populates the list of files matching a wildcard filename in
Calculate Greenplum table name
: This is an optional parameter used to calculate (using JavaScript) the destination table name based on the source file name. The original file name, without the extension, is stored in the variablename
and the actual table name must be assigned to the variablevalue
. For example, let's assume that the source file name isdb_schema-table_cdc_stream_uuid.csv
. The JavaScript code to calculate the destination table name will bevalue = name.substring(name.indexOf('-') + 1, name.indexOf('_cdc_stream'));
. IMPORTANT: the flow will automatically recognize and transform filenames matching the following template:*_cdc_stream_*.*
. If filenames are matching this template there is no need to provide a transformation script.Maximum Number of Files to Process
: the optional parameter that controls how many files will be loaded in one batch (one run of the Flow). If you expect that at any point in time there will be hundreds of thousands or millions (10^6 - 10^7) of files in the source folder(s), it is recommended to limit the number of files to process to more manageable: hundreds to thousands (10^3-10^5).Maximum Number of Parallel Loads
: if the value of this optional parameter is greater than 1, the Flow will create multiple threads to load data in parallel –– one thread per table but not more than the threshold set in this parameter.Parallel
: enable this flag if you have more than one source-to-destination transformation that you want to execute in parallel.Purge File if Success
: if this option is enabled (by default), the source files will be automatically deleted after theCOPY INTO
command is successfully executed.
Debug
Log each executed SQL statement
: enable this flag if you want to log each automatically generated and executed SQL statement, including before/after Load SQL.
Error recovery
Continue loading data into other tables if Error
: if this parameter is enabled (it is disabled by default) and there is an error when loading data into some table the flow will continue loading data into other tables. If configured the notification will be sent to the webhook.
Load parameters
Action
: can beINSERT
which inserts records from the file(s) into the Greenplum table,MERGE
which merges records from the file with the records in the Greenplum table, andCDC MERGE
which loads CDC events (INSERT
/UPDATE
/DELETE
) in the same order as they originated in the source database.MERGE
andCDC MERGE
require configuring theLookup Fields
or/and enabling the .Lookup Fields
: the comma-separated list of fields that uniquely identify the record in the target Greenplum table.- If this option is enabled and
Lookup Fields
is empty, the system will try to predict the columns that uniquely identify the record in the target Greenplum table.
Hardcoded gpload parameters
Content of the Control File
: optional hardcoded gpload parameters. By default, the Flow will generate a control file automatically. The following tokens can be used as a part of the control file:{FILE}
: the data file name{TABLE}
: the destination table name{database}
: the Greenplum database name{host}
: the Greenplum hostname{port}
: the Greenplum port number{user}
: the Greenplum user name{password}
: the Greenplum password{COLUMNS}
: the destination table column{MATCH_COLUMNS}
: the lookup columnsUPDATE_COLUMNS
: the columns to update
Parameters for automatically generated gpload control file
Format
: specifies the Format of the source data file(s) –– either plain text (TEXT
) or comma-separated values (CSV) Format. Defaults toTEXT
if not specified. For more information about the Format of the source data, see 'Loading and Unloading Data' in the Greenplum Database Administrator Guide.String that represents NULL values
: specifies the string that represents aNULL
value. The default is backslash-N inTEXT
mode and an empty value with no quotations in CSV mode. You might prefer an empty string even inTEXT
mode for cases where you do not want to distinguish nulls from empty strings. Any source data item that matches this string will be considered a null value.Force NOT NULL
: in CSV mode, processes each specified column as though it were quoted and hence not aNULL
value. The default null string in CSV mode (nothing between two delimiters) causes missing values to be evaluated as zero-length strings.Encoding
: character set encoding of the source data. Specify a string constant (such as SQL_ASCII), an integer encoding number, orDEFAULT
to use the default client encoding. If not specified, the default client encoding is used. For information about supported character sets, see the Greenplum Database Reference Guide.Max # of Errors to Ignore
: enables single row error isolation mode for this load operation. When enabled, input rows that have Format errors will be discarded provided that the error limit count is not reached on any Greenplum Database segment instance during input processing. If the error limit is not reached, all good rows will be loaded, and any error rows will either be discarded or captured as part of error log information. The default is to abort the load operation on the first error encountered. Note that single row error isolation only applies to data rows with Format errors, for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors, such as primary key violations, will still cause the load operation to abort if encountered. For information about handling load errors, see 'Loading and Unloading Data' in the Greenplum Database Administrator Guide.Log Errors
: value is either true or false. The default value is false. If the value is true, rows with formatting errors are logged internally when running in single row error isolation mode. You can examine formatting errors with the Greenplum Database built-in SQL functiongp_read_error_log('table_name')
. If formatting errors are detected when loading data, gpload generates a warning message with the name of the table that contains the error information similar to this message.Truncate prior to loading
: if set to true, gpload will remove all rows in the target table prior to loading it.Reuse Tables
: if set to true, gpload will not drop the external table objects and staging table objects it creates. These objects will be reused for future load operations that use the same load specifications. This improves the performance of trickle loads (ongoing small loads to the same target table).Fully qualified domain name
: specifies whether gpload resolves hostnames to the fully qualified domain name (FQDN) or the local hostname. If the value is set to true, names are resolved to the FQDN. If the value is set to false, the resolution is to the local hostname. The default is false. A fully qualified domain name might be required in some situations. For example, if the Greenplum Database system is in a different domain than an ETL application that is being accessed by gpload.SSL
: specifies usage of SSL encryption. If SSL is set to true, gpload starts the gpfdist server with the--ssl
option and uses thegpfdists:// protocol;
SSL Certificates Path
: required when SSL is true; cannot be specified when SSL is false or unspecified. The location specified inCERTIFICATES_PATH
must contain the following files:
- The server certificate file:
server.crt
- The server private key file:
server.key
- The trusted certificate authorities:
root.crt
- The server certificate file:
ELT
Set variables for Before and After SQL
: code in JavaScript to set the variables that can be referenced as {varname} in the Before and/or After SQL. To set the variable add it as a key-value pair to the script variable 'vars'.- Example:
vars.put('schema', 'abc');
- Example of the SQL:
insert into test (schema_name) values ('{schema}')
- Example:
Before LOAD SQL
: this SQL will be executed on the Greenplum Connection before loading data. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.Ignore errors when executing Before LOAD SQL
:if this option is enabled, and there is an error whenBefore LOAD SQL
is executed, the error will be ignored.Before LOAD SQL is a script
: if this option is enabled 'before LOAD SQL' will be executed as ascript.After LOAD SQL
: this SQL will be executed on the Greenplum Connection after loading data. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.Ignore errors when executing After LOAD SQL
:if this option is enabled, and there is an error whenAfter LOAD SQL
is executed, the error will be ignored.After LOAD SQL is a script
: if this option is enabled 'after LOAD SQL' will be executed as ascript.
Handling source schema changes
Alter target table if the source has columns that the target table doesn't have
: if this option is enabled (it is disabled by default), and the source has different fields than the target table, the system will add extra fields to the target table.
Step 7. Optionally, add a mapping.
The mapping can be used to:
- Rename the column.
- Exclude the column.
- Change the column data type for tables automatically created by the flow.
Configure mapping
Select transformation and click MAPPING
.
Add mapping for column CURRENT_DATE
.
The renaming or excluding the column in mapping will work for all tables that have this column. If the table does not have the column the mapping will be ignored.
Step 8. Optionally, add more source-to-destination transformations
Click [+]
and repeat steps 5 to 6 for more locations and/or file name templates.
Comments
0 comments
Please sign in to leave a comment.