Using Redshift-specific flows you can extract data from any of the supported sources and load it directly into Redshift.
A typical Redshift flow performs the following operations:
- Extracts data from the source.
- Creates CSV files.
- Compresses files using the gzip algorithm.
- Copies files into Amazon S3 bucket.
- Checks to see if the destination Redshift table exists, and if it does not - creates the table using metadata from the source.
- Dynamically generates and executes the Redshift COPY command.
- Cleans up the remaining files, if needed.
- The Redshift is up and running and available from the Internet.
- The Redshift user has INSERT privilege for the table(s).
- The Amazon S3 bucket is created and Redshift is able to access the bucket.
Step 1. Create all required connections.
when configuring a connection for Amazon S3, which will be used as a stage for the Redshift flows, it is recommended that you select
GZip as the value for the Archive file before copying field:
Step 2. Create a data exchange format
When configuring the CSV format, it is recommended to set the Value for null field to
\N, so the Redshift COPY command can differentiate between an empty string and NULL value.
if you are using CSV format for loading large datasets into the Redshift, consider configuring a format to split the document into smaller files: Redshift can load files in parallel, also transferring smaller files over the network can be faster.
Step 3. Create a flow to load data in Redshift
Start creating Redshift flows by opening the Flows window, clicking the
+ button, and typing
redshift into the search field:
Continue by selecting the flow type, adding source-to-destination transformations and entering the transformation parameters:
Step 4. Set Redshift Connection
For all Redshift flows, the final destination is
To configure the final destination, click the Connections tab and select the available Redshift connection.
Step 5. Set the source and the destination
Depending upon the flow type, you can select one of the following sources (FROM) for the Redshift flow:
- Database - use the table name as the source (FROM) name
- CDC - use the fully qualified table name as the source (FROM) name
- API - use any appropriate string as the source (FROM) name
- Web Service - use any appropriate string as the source (FROM) name
- File - use the source file name or a wildcard filename as the source (FROM) name
- Queue - use the queue topic name as the source (FROM) name
For all Redshift flows, the destination connection is going to be Amazon S3 connection. Integrator uses the destination connection as a Redshift stage.
Select or enter the fully-qualified Redshift table name as a destination (TO).
Step 6. Set Data Exchange Format configured in step 2
Step 7. Set the optional parameters
Click MAPPING button in the transformation row and select the Parameters tab.
Configure optional parameters and transformations:
- Redshift Table Name - the name of the Redshift table to load data into. This field is optional and overrides the Redshift table name set at the transformation level.
- Column List - you can specify a comma-separated list of column names to load source data fields into specific target columns. The columns can be in any order in the COPY statement, but when loading from flat files, such as in an Amazon S3 bucket, their order must match the order of the source data.
- Truncate Columns - if this option is enabled, the string literals will be truncated during the load to the maximum allowed size for the designated column.
- Load empty chars as null - Indicates that Amazon Redshift should load empty CHAR and VARCHAR fields as NULL. Empty fields for other data types, such as INT, are always loaded with NULL. Empty fields occur when data contains two delimiters in succession with no characters between the delimiters.
- Load blank chars as null - Loads blank fields, which consist of only white space characters, as NULL. This option applies only to CHAR and VARCHAR columns. Blank fields for other data types, such as INT, are always loaded with NULL. For example, a string that contains three space characters in succession (and no other characters) is loaded as a NULL. The default behavior, without this option, is to load the space characters as is.
- Accept any date - Allows any date format, including invalid formats such as 00/00/00 00:00:00, to be loaded without generating an error. This parameter applies only to TIMESTAMP and DATE columns.
- Accept invalid UTF8 chars - Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters. When ACCEPTINVCHARS is specified, COPY replaces each invalid UTF-8 character with a string of equal length consisting of the character specified by replacement_char. The default replacement_char is a question mark ( ? ).
- Escape - When this parameter is specified, the backslash character in input data is treated as an escape character. The character that immediately follows the backslash character is loaded into the table as part of the current column value, even if it is a character that normally serves a special purpose. For example, you can use this parameter to escape the delimiter character, a quotation mark, an embedded newline character, or the escape character itself when any of these characters is a legitimate part of a column value.
- Remove quotes - Removes surrounding quotation marks from strings in the incoming data. All characters within the quotation marks, including delimiters, are retained.
- Explicit IDs - Use EXPLICIT_IDS with tables that have IDENTITY columns if you want to override the autogenerated values with explicit values from the source data files for the tables. If the command includes a column list, that list must include the IDENTITY columns to use this parameter. The data format for EXPLICIT_IDS values must match the IDENTITY format specified by the CREATE TABLE definition.
- File Format - the CSV format definition. Read about Redshift file formats. The format definition is optional. If not provided - the system will automatically create an ad-hoc definition based on the destination format configured for the transformation.
- IAM Role - the optional AWS IAM role used to access the S3 bucket. If the role is not provided the flow will use the Access Key and Secret configured for the S3 connection used in this transformation.
- Max # of Errors to Ignore - The default value is 0 and the limit is 100000. Use this parameter to allow loads to continue when certain rows fail to load into the table because of formatting errors or other inconsistencies in the data. Set this value to 0 or 1 if you want the load to fail as soon as the first error occurs.
- Refresh of Optimizer Statistics - Governs automatic computation and refresh of optimizer statistics at the end of a successful COPY command. By default, if the STATUPDATE parameter is not used, statistics are updated automatically if the table is initially empty. With STATUPDATE ON, statistics are updated automatically regardless of whether the table is initially empty. With STATUPDATE OFF, statistics are never updated.
- No Load - Checks the validity of the data file without actually loading the data. Use the NOLOAD parameter to make sure that your data file will load without any errors before running the actual data load. Running COPY with the NOLOAD parameter is much faster than loading the data because it only parses the files.
- Action - can be either
COPY INTO- inserts records from the file(s) into the Redshift table or
MERGE- merges records from the file with the records in the Redshift table. MERGE requires configuring Lookup Fields.
- Lookup Fields - the comma-separated list of fields which uniquely identify the record in the Redshift table. This field is required if action is set to MERGE.
- Copy SQL - this is a user-defined COPY SQL. By default, Integrator creates COPY SQL automatically, based on the input and output parameters. You can override it by using this field. Read more about the Redshift COPY command.
- Purge File if Success - if this option is enabled (by default), and the user-defined SQL is not used, Integrator will run the COPY INTO command with the purge option enabled. With purge enabled, the staging file will be automatically deleted after the COPY INTO command is successfully executed.
- Purge File(s) if Error - if this option is enabled (by default) the staging files will be automatically deleted if a COPY INTO command was executed with an error.
- Name(s) of the File(s) to Purge - if "Purge File(s) if Error" is enabled, it is possible to configure which files you want the system to delete if an error occurs while executing the COPY command. Wildcard file names are allowed. If not specified, the system will use the same wildcard file name as it used for the COPY command.
- Before COPY SQL - this SQL will be executed on the Redshift connection before running COPY SQL.
- Ignore errors when executing Before COPY SQL - if this option is enabled, and there is an error when "Before COPY SQL" is executed - the error will be ignored.
- After COPY SQL - this SQL will be executed on the Redshift connection after running COPY SQL.
- Ignore errors when executing After COPY SQL - if this option is enabled, and there is an error when "After COPY SQL" is executed - the error will be ignored.
Depending on the flow type, other flow parameters can be added, as explained below:
Step 8. Configure MERGE (UPSERT) if needed
To merge (upsert) existing data in the Redshift table with new data:
- Set the Action to MERGE.
- Define the Lookup Fields - the comma-separated list of fields which uniquely identify the record in the Redshift table.
Step 9. Configure how flow handles cases when source and destination have a different number of columns
It is quite typical when the source (for example, the table in the OLTP database) and the destination (Redshift table) have a different number of columns. If it is the case, by default, the flow will fail since the Redshift COPY command cannot load files which have more or fewer columns than the target table.
When setting the source-to-destination transformation it is possible to configure it to automatically handle both edge cases:
- Adjust if Source has More Fields - it this option is enabled (it is disabled by default) and the source has more fields than the target (Redshift) table the system will automatically exclude extra fields.
- Adjust if Destination has More Fields - if this option is enabled (it is disabled by default) and the target (Redshift) table has more fields than the source the system will automatically add empty fields.
It is possible to send an email notification if the source and destination have a different number of columns. To configure flow to send a notification when either source has more columns than the destination or the destination has more columns than the source, use technique explained in this article.
Step 10. Configure Mapping if needed
If necessary, you can create a mapping between the source and destination (Redshift) fields.
You create mapping between the source and destination just like you usually do for any other flow type.
mapping is not required, but please remember that if a source field name is not supported by Redshift, it will return an error and the data will not be loaded into the database. For example, if you are loading data from Google Analytics, the output (source) is going to include fields with the prefix
ga: ( ga:user, ga:browser, etc. ). Unfortunately, Redshift does not support fields with a
: , so the data will be rejected. If that happens, you can use mapping to rename the destination fields:
Step 11. Setup Change Replication using high watermark if needed
As in any other flow type, it is possible to configure a change replication using high watermark.
When change replication is enabled, only the changed records will be loaded into Redshift.
Basically, all you need to do is set the high watermark field and enable change replication for the transformation. Additionally, for better reliability, you can set the calculated high watermark field value.
Setup change replication using CDC if needed
Change data capture (CDC) is an approach to data integration that is based on the identification, capture, and delivery of the changes made to the source database and stored in the database redo log (also called transaction log).
Etlworks supports replicating data using CDC from the MySQL, SQL Server, Oracle, and Postgres.
Read about configuring CDC for the source databases:
Once the CDC is configured for the source database you can create a CDC flow where the source is one these databases and the destination is a Redshift.
Step 1. Create a source CDC connection for the source database.
enable the Always generate INSERT.
Step 2. Create a connection to the Redshift. Disable the auto-commit.
Step 3. Create a new flow by selecting any of the CDC flows from the gallery.
Step 4. All parameters are the same as in the regular Redshift flow, except the Action, which must be set to
CDC MERGE (setting Lookup Fields is also required):
Common issues when loading data in Redshift
Read how to troubleshoot and fix common issues when loading data in Amazon Redshift.