When to use this Flow type
This Flow bulk-loads data into Redshift using the user-defined COPY command. The data can be files in file-based or cloud storage, responses from APIs, email attachments, or objects stored in a NoSQL database.
Flows optimized for Redshift
Flow type | When to use | |
|
|
When you need to extract data from any source, transform it and load it into Redshift. |
Bulk load files in S3 into Redshift | When you need to bulk-load files that already exist in S3 without applying any transformations. The flow automatically generates the COPY command and MERGEs data into the destination. | |
Stream CDC events into Redshift | When you need to stream updates from the database which supports Change Data Capture (CDC) into Redshift in real-time. | |
Stream messages from queue into Redshift | When you need to stream messages from the message queue which supports streaming into Redshift in real-time. | |
COPY files into Redshift | You are here | When you need to bulk-load data from the file-based or cloud storage, API, or NoSQL database into Redshift without applying any transformations. This flow requires providing the user-defined COPY command. Unlike Bulk load files in S3 into Redshift, this flow does not support automatic MERGE. |
How it works
This flow copies files from the source to S3 and then executes the user-defined COPY command.
Prerequisites
- The Redshift is up and running and available from the Internet.
- The Redshift user has
INSERT
privilege for the table(s). - The Amazon S3 bucket is created, and Redshift is able to access the bucket.
Step-by-step process
Step 1. Create a source Connection.
Step 2. Create and test the Redshift Connection as explained here.
Step 3. Create an S3 connection. When configuring the S3 Connection, select GZip
as the value for the Archive file before copying to S3
field.
Step 4. In Flows click Add flow button and enter Redshift
in Select Flow Type
.
Step 5. Select Copy files into Redshift
.
Step 6. Continue by defining the source and destination.
Step 7. For the source FROM
, select the Connection created in Step 1 and enter a file name or a wildcard file name, for example, *.csv
.
Step 8. For the destination TO
, select the Connection created in Step 3 and enter the base destination table name (without the database name and schema name).
Step 9. Select Connections
tab and select the Redshift Connection created in Step 2.
Step 10. Click MAPPING
, select Parameters
and continue by defining the parameters. The only required parameter is Copy SQL
:
Step 11. Define the other parameters (the rest are optional):
- Action: if
Move
is selected (default), the file(s) will be copied to the S3 bucket and removed from the original source. IfCopy
is selected, the file(s) will be copied to the S3 bucket and retain in the original source. - Add Suffix to the Destination File Name: you can select one of the predefined suffixes for the files created using this file operation. For example, if you select
uuid
as a suffix, and the original filename isdest.csv
, Etlworks will create files with the namedest_uuid.csv
, whereuuid
is a globally unique identifier such as21EC2020-3AEA-4069-A2DD-08002B30309D
. The default value for this field isuuid
. - Do not process files that have been already processed: if this option is enabled, the system will skip files that have already been loaded into Redshift.
- Maximum Simultaneous Operations: Etlworks can copy or move each file in its own thread. Use this property to set the maximum number of simultaneous file operations.
- Purge File(s) if Error: if this option is enabled (by default), the staging files will be automatically deleted if the
COPY
command was executed with an error. - Name(s) of the File(s) to Purge: having
Purge File(s) if Error
enabled makes it possible to configure which files you want the system to delete if an error occurs while executing theCOPY
command. Wildcard file names are allowed. If not specified, the system will use the same wildcard file name it used for theCOPY
command. - Before COPY INTO SQL: this SQL will be executed on the Redshift Connection
Before COPY INTO SQL
. - Ignore errors when executing Before COPY SQL: if this option is enabled, and there is an error when
Before COPY SQL
is executed, the error will be ignored. - After COPY SQL: this SQL will be executed on the Redshift Connection
After COPY SQL
. - Ignore errors when executing After COPY SQL: if this option is enabled, and there is an error when
After COPY SQL
is executed, the error will be ignored.
Step 12. Save and run the Flow.
Comments
0 comments
Please sign in to leave a comment.