When to use this Flow type
If you already have CSV files in Amazon S3 and don't need to transform the data, the most efficient way to load these files into Redshift is to use the Flow type Bulk load files in S3 into Redshift
described in this article.
Use this flow type
- When you simply want to load CSV files that already exist in Amazon S3 into Redshift, without applying any transformations or transferring files over the network.
- When you want to load CSV files created by the CDC Flow.
Flows optimized for Redshift
Flow type | When to use | |
|
|
When you need to extract data from any source, transform it and load it into Redshift. |
Bulk load files in S3 into Redshift | You are here | When you need to bulk-load files that already exist in S3 without applying any transformations. The flow automatically generates the COPY command and MERGEs data into the destination. |
Stream CDC events into Redshift | When you need to stream updates from the database which supports Change Data Capture (CDC) into Redshift in real-time. | |
Stream messages from queue into Redshift | When you need to stream messages from the message queue which supports streaming into Redshift in real-time. | |
COPY files into Redshift | When you need to bulk-load data from the file-based or cloud storage, API, or NoSQL database into Redshift without applying any transformations. This flow requires providing the user-defined COPY command. UnlikeBulk load files in S3 into Redshift, this flow does not support automatic MERGE. |
How it works
- The Flow reads the names of all files matching the wildcard in the specific S3 bucket. It can traverse the subfolders as well.
- The Flow calculates the destination table names based on the source file names.
- The Flow creates multiple threads, one per destination table but not more than the configurable threshold.
- The Flow generates and executes the
COPY
command to load files into the temporary table –– one temporary table per actual destination table. - The Flow mergers data in the temporary tables with data in the actual tables.
Features
The Flow supports COPY
(INSERT), MERGE
(UPSERT), and CDC MERGE
. With CDC
MERGE
, the Flow updates the actual destination table by applying INSERT
/UPDATE
/DELETE
events in the same order as they were originated in the source database and recorded in CSV files. It performs it in the most efficient way possible.
Other Features | Description |
Monitor source schema changes | The Flow is able to CREATE and ALTER destination tables to keep up with schema changes in the source. |
Delete loaded source files | The Flow can be configured to delete successfully loaded source files. |
Load data in parallel | The Flow is able to load data into multiple destination tables in parallel. |
Process files in subfolders | The Flow is able to process files in multiple nested subfolders. |
Create all or selected columns as TEXT | The Flow is able to create all or selected columns as TEXT which mitigates issues related to the source schema drift. |
Prerequisites
- The Redshift is up and running and available from the Internet.
- The Redshift user has
INSERT
privilege for the table(s). - The Amazon S3 bucket is created, and Redshift is able to access the bucket.
- Files to load exist in the bucket.
Process
Step 1. Create a new S3 Connection
This Connection will be used as a source to access files stored in Amazon S3.
Step 2. Create a new Redshift Connection
This Connection will be used as a destination.
Step 3. Create a new CSV Format
This Format will be used to generate the valid COPY
command.
It is recommended to set Encoding
to UTF-8
. Keep the default values for other parameters.
Step 4. Create a new Flow
This Flow will be used to load files in S3 into Redshift.
In Flows
click [+]
, type in bulk load files in S3
, and select the Flow Bulk load files in S3 into Redshift
.
Step 5. Configure load transformation
Select or enter the following attributes of the transformation (left to right):
- S3 Connection created in step 1.
- CSV Format created in step 3.
- A wildcard filename that matches the filenames in the cloud storage. Use
.csv
extension for uncompressed files and.gz
for compressed. - Redshift Connection created in step 2.
- The wildcard destination table name in the following format:
schema.*
, whereschema
is a Redshift schema to load data into. Example:public.*
. You don't need to use wildcards when loading into a specific table.
Step 6. Set optional and required parameters
Click MAPPING
.
Select the Parameters
tab.
Select or enter the following optional and required parameters:
Source Files and Destination Tables
Include files in subfolders
: if this option is enabled, the Flow will load all files that match the wildcard filename (set inFROM
) in all subfolders under the main bucket or blob.Exclude and Exclude files
: the optional comma-separated list of files to exclude and/or include. You can use wildcard file names. These options used together with the wildcard filename set inFROM
:- The Flow populates the list of files matching a wildcard filename in
FROM
. - The Flow excludes files based on the list of exclusions set in
Exclude files
. - The Flow includes files based on the list of inclusion set in
Include files
.
- The Flow populates the list of files matching a wildcard filename in
Calculate Redsift table name
: This is an optional parameter used to calculate (using JavaScript) the destination (Redshift) table name based on the source file name. The original file name, without the extension, is stored in the variablename
and the actual table name must be assigned to the variablevalue
. For example, let's assume that the source file name isdb_schema-table_cdc_stream_uuid.csv
. The JavaScript code to calculate the destination table name will bevalue = name.substring(name.indexOf('-') + 1, name.indexOf('_cdc_stream'));
. IMPORTANT: the flow will automatically recognize and transform filenames matching the following template:*_cdc_stream_*.*
. If filenames are matching this template there is no need to provide a transformation script.Maximum Number of Files to Process
: the optional parameter that controls how many files will be loaded in one batch (one run of the Flow). If you expect that at any point in time there will be hundreds of thousands or millions (10^6 - 10^7) of files in the source folder(s), it is recommended to limit the number of files to process to more manageable: hundreds to thousands (10^3-10^5).Maximum Number of Parallel Loads
: if the value of this optional parameter is greater than 1, the Flow will create multiple threads to load data in parallel –– one thread per table but not more than the threshold set in this parameter.Parallel
: enable this flag if you have more than one source-to-destination transformation that you want to execute in parallel.Create new Redshift connection for each table
: if this parameter is enabled the new Redshift connection will be created to load data into each destination table. The connections will be closed when the flow will finish loading data into all tables.Purge File if Success
: if this option is enabled (by default), the source files will be automatically deleted after theCOPY
command is successfully executed.
Debug
Log each executed SQL statement
: enable this flag if you want to log each automatically generated and executed SQL statement, including before/after Load SQL.
Error recovery
Continue loading data into other tables if Error
: if this parameter is enabled (it is disabled by default) and there is an error when loading data into some table the flow will continue loading data into other tables. If configured the notification will be sent to the webhook.
Load and Merge
IAM ROLE
: the optional AWS IAM role used to access the S3 bucket. If the role is not provided, the Flow will use theAccess Key
andSecret
configured for the S3 Connection used in this transformation.Action
: can beINSERT
which inserts records from the file(s) into the Redshift table,MERGE
which merges records from the file with the records in the Redshift table, andCDC MERGE
which loads CDC events (INSERT
/UPDATE
/DELETE
) in the same order as they were originated in the source database.MERGE
andCDC MERGE
require configuring theLookup Fields
or/and enabling the .How to MERGE
: the type of SQL to execute when merging data. The following options are available:Native MERGE
(preview) - execute native Redshift MERGE SQL. Note that MERGE command is in preview;DELETE/INSERT
(default) - DELETE all records in the actual table that also exist in the temp table, then INSERT all records from the temp table into the actual table.Lookup Fields
: the comma-separated list of fields that uniquely identify the record in the target Redshift table.- If this option is enabled and
Lookup Fields
is empty, the system will try to predict the columns that uniquely identify the record in the target Redshift table. Load into staging table
: by default, the Flow will attempt to create and load data into the temporary table. When this parameter is enabled (it is disabled by default) the flow will create the staging table instead of temporary. It will automatically drop the staging table on the exit.User-defined MERGE SQL
: by default, the MERGE SQL is generated automatically. Use this parameter to override automatically generated SQL. The following {tokens} are available:- {TABLE} - the destination table.
- {TEMP_TABLE} - the staging or temp table.
- {FIELDS} - the fields to INSERT/UPDATE in the destination table.
- {INSERT_FIELDS} - the values of the fields to INSERT into the destination table.
- {UPDATE_FIELDS} - the fields and values to UPDATE in the format field=value,field=value.
- {UPDATE_CONDITIONS} - the condition to match records in the actual destination and temp/staging tables.
Truncate Columns
: if this option is enabled, the string literals will be truncated during the load to the maximum allowed size for the designated column.String used to convert top SQL NULL
: Redshift loads fields that match null_string as NULL, where null_string can be any string. If your data includes a null terminator, also referred to as NUL (UTF-8 0000) or binary zero (0x000), COPY treats it as any other character. For example, a record containing '1' || NUL || '2' is copied as string of length 3 bytes. If a field contains only NUL, you can use NULL AS to replace the null terminator with NULL by specifying '\0' or '\000' — for example, NULL AS '\0' or NULL AS '\000'. If a field contains a string that ends with NUL and NULL AS is specified, the string is inserted with NUL at the end. Do not use '\n' (newline) for the null_string value. Amazon Redshift reserves '\n' for use as a line delimiter. The default null_string is '\N'.Load empty chars as null
: Indicates that Amazon Redshift should load emptyCHAR
andVARCHAR
fields asNULL
. Empty fields for other data types, such as INT, are always loaded withNULL
. Empty fields occur when data contains two delimiters in succession with no characters between the delimiters.Load blank chars as null
: Loads blank fields, which consist of only white space characters, asNULL
. This option applies only toCHAR
andVARCHAR
columns. Blank fields for other data types, such as INT, are always loaded withNULL
. For example, a string that contains three space characters in succession (and no other characters) is loaded as aNULL
. The default behavior, without this option, is to load the space characters as is.Accept any date
: Allows any date Format, including invalid Formats such as 00/00/00 00:00:00, to be loaded without generating an error. This parameter applies only toTIMESTAMP
andDATE
columns.Accept invalid UTF8 chars
: Enables loading of data intoVARCHAR
columns even if the data contains invalid UTF-8 characters. WhenACCEPTINVCHARS
is specified,COPY
replaces each invalid UTF-8 character with a string of equal length consisting of the character specified by replacement_char. The default replacement_char is a question mark (?
).Explicit IDs
: UseEXPLICIT_IDS
with tables that haveINDENTITY
columns if you want to override the autogenerated values with explicit values from the source data files for the tables. If the command includes a column list, that list must include theINDENTITY
columns to use this parameter. The data format forEXPLICIT_IDS
values must match theINDENTITY
Format specified by theCREATE TABLE
definition.Max # of Errors to Ignore
: The default value is0
and the limit is 100000. Use this parameter to allow loads to continue when certain rows fail to load into the table because of formatting errors or other inconsistencies in the data. Set this value to0
or1
if you want the load to fail as soon as the first error occurs.
ELT
Set variables for Before and After SQL
: code in JavaScript to set the variables that can be referenced as {varname} in the Before and/or After SQL. To set the variable add it as a key-value pair to the script variable 'vars'.- Example:
vars.put('schema', 'abc');
- Example of the SQL:
insert into test (schema_name) values ('{schema}')
- Example:
Before Load SQL
: this SQL will be executed on the Redshift Connection before runningCOPY SQL
.Ignore errors when executing Before Load SQL
: if this option is enabled, and there is an error whenBefore Load SQL
is executed, the error will be ignored.Before Load SQL is a script
: if this option is enabled,Before Load SQL
will be executed as a Redshift Script.After Load SQL
: this SQL will be executed on the Redshift Connection after runningCOPY SQL
.Ignore errors when executing After Load SQL
: if this option is enabled and there is an error whenAfter Load SQL
is executed, the error will be ignored.After Load SQL is a script
: if this option is enabled,After Load SQL
will be executed as a Redshift script.
Handling source schema changes
Alter target table if the source has columns that the target table doesn't have
: if this option is enabled (it is disabled by default), and the source has different fields than the target (Redshift) table, the system will add extra fields to the target table.Create all tables with only TEXT columns
: if this option is enabled (it is disabled by default) the flow will create all tables with only TEXT columns. Enable this option to make the load more resilient to the source schema drift. By default, the flow creates the actual table using the first file as a pattern. It then alters the actual table by adding new columns but it never modifies the data types of the existing columns which could cause issues if there are multiple source files to load into the same destination table and some of these files have columns with different data types (compare to each other).Tables with all TEXT columns
: a comma-separated list of regular expressions matching the table names for which the flow should create all columns asTEXT
.
Here are the rules for regexes:
-
- Regexes are case insensitive.
- If regex includes comma (
,
) it must be escaped (/,
). - Java regex syntax must be used.
- If the destination table name is configured using fully qualified (DB.SCHEMA.TABLE) or partially qualified (SCHEMA.TABLE) notation the regex should reflect it. Example (include tables
country
andcity
):text.*
->.*country,.*city
.
TEXT columns
: a comma-separated list of regular expressions matching the column names which should be created asTEXT
.
Here are the rules for 3 parameters above:
-
-
- If
Create all actual tables with only TEXT columns
is enabled all columns in all new tables and new columns in the existing tables will be created asTEXT
. - If the table name matches the regex in
Actual tables with all TEXT columns
andTEXT columns
is empty all columns in that table will be created asTEXT
. - If
TEXT columns
is not empty and the column name in any table matches any regex configured in this field the column will be created asTEXT
.
- If
-
Step 7. Optionally, add a mapping.
The mapping can be used to:
- Rename the column.
- Exclude the column.
- Change the column data type for tables automatically created by the flow.
Configure mapping
Select transformation and click MAPPING
.
Add mapping for column CURRENT_DATE
.
The renaming or excluding the column in mapping will work for all tables that have this column. If the table does not have the column the mapping will be ignored.
Step 8. Optionally, add more source-to-destination transformations
Click [+]
and repeat steps 5 to 6 for more locations and/or file name templates.
Comments
0 comments
Please sign in to leave a comment.