About S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
Etlworks S3 connector supports reading, creating, updating, renaming, and deleting objects in S3.
When to use this connector
- When working with files when the source or destination is S3.
- When loading data in Snowflake or Redshift.
- When streaming CDC events to the cloud storage.
Creating a connection
Step 1. In the Connections
window, click +
, type in s3
.
Step 2. Select Amazon S3 (SDK)
.
Step 3. Enter Connection parameters.
Connection parameters
Common parameters
-
AWS Region
: the AWS region. This parameter is only available for the SDK connector. -
Bucket
: the bucket name. -
Directory
: the directory under the bucket. This parameter is optional. -
Files
: the actual file name or a wildcard file name, for example,*.csv
.
Authentication
Etlworks S3 connector supports authentication with access/secret key and IAM role.
-
Access Key or IAM Role
: the AWS access key or IAM role name. -
Secret Access Key
: the AWS secret access key. Note: the secret access key must be empty if authenticating with the IAM role. -
External ID
: In abstract terms, the external ID allows the user that is assuming the role to assert the circumstances in which they are operating. It also provides a way for the account owner to permit the role to be assumed only under specific circumstances. The primary function of the external ID is to address and prevent the confused deputy problem.
If both authentication parameters are empty the connector will attempt to authenticate using the default profile configured for the EC2 instance running Etlworks.
NOTE: If you are randomly getting error "Failed to load credentials from IMDS" consider configuring the auto-rerty with at least 5 retry attempts.
Other parameters
-
Metadata
: S3 object metadata. You can set object metadata in Amazon S3 at the time you upload the object. Object metadata is a set of name-value pairs. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata. Metadata is ignored when the multipart upload is configured. Download chunk size (bytes)
: This setting allows you to specify the size of each data chunk, in bytes, that will be downloaded from the server during a file transfer. Using chunked downloads can improve reliability and performance, especially when working with large files. If a download is interrupted, only the current chunk needs to be retried, rather than starting from the beginning. Adjust this value based on your network speed and the size of the files being downloaded.-
Part Size For Multipart Upload (bytes)
: by entering a number greater than 5242880, you will enable multipart upload to S3. If nothing is entered (default), the multipart upload is disabled. The minimum part size is 5242880; meanwhile, the maximum is 5368709120. Multipart Uploads involve uploading an object's data in parts instead of all at once, which can give the following advantages:- objects larger than 5 GB can be stored.
- large files can be uploaded in smaller pieces to reduce the impact of transient uploading/networking errors.
- objects can be constructed from data that is uploaded over a period of time, when it may not all be available in advance.
-
Add Suffix When Creating Files In Transformation
: you can select one of the predefined suffixes for the files created using this Connection. For example, if you selectuuid
as a suffix and the original file name isdest.csv
, Etlworks will create files with the namedest_uuid.csv
, where uuid is a globally unique identifier such as21EC2020-3AEA-4069-A2DD-08002B30309D
.
This parameter works only when the file is created using source-to-destination-transformation. Read how to add a suffix to the files created when copying, moving, renaming, and zipping files.
-
File Processing Order
: Specifies the order in which source files are processed when using wildcard patterns in ETL and file-based flows (e.g., copy, move, delete). The default setting is Oldest, meaning files are processed starting with the oldest by creation or modification time. Choose from various criteria such as file age, size, or name to determine the processing sequence:- Disabled: wildcard processing is disabled,
- Oldest/Newest: Process files based on their creation or modification time, Ascending/Descending: Process files in alphabetical order, Largest/Smallest: Process files based on their size.
-
Archive file before copying to
: Etlworks can archive files, using one of the supported algorithms (zip or gzip), before copying them to cloud storage. Since cloud storage is typically a paid service, it can save money and time if you choose to archive files. -
Contains CDC events:
When this parameter is enabled, the Etlworks adds standard wildcard templates for CDC files to the list of available sources in the FROM selector.
Auto-retry
To configure auto-retry for each individual request to AWS API set the following parameters:
-
Number of Retries
: the maximum number of times that a single request should be retried, assuming it fails for a retryable error. -
Initial wait time (ms)
: the initial wait time in milliseconds before making the first retry attempt. This delay increases exponentially with each subsequent retry, often combined with jitter to avoid collisions from simultaneous retries. The default is 500 milliseconds. -
Maximum delay (seconds)
: without a maximum limit, the wait time can become excessively long, especially after multiple retries. This can lead to significant delays in processing. The default is 10 seconds.
Chunked download and upload
When working with large files in Amazon S3, downloading or uploading the entire file in one go can be inefficient and prone to errors, especially for applications with limited memory or unstable network connections. A common approach to handling this is by using chunked downloads and uploads, which breaks the file into smaller parts. Here are the key advantages of using chunked transfers for both download and upload in an S3 connection:
-
Improved Memory Efficiency
-
Increased Reliability and Resilience
-
Parallel Processing for Faster Transfers
-
Resume Interrupted Transfers
-
Lower Latency for Large Files
-
Scalability for Large File Transfers
It is essential to optimize both the upload and download processes to ensure efficiency, reliability, and performance. Two critical settings that impact this are the Part Size for Multipart Upload and the Download Chunk Size. These parameters allow fine-tuning of how files are split into manageable parts for transfer, improving memory usage, reducing the chance of network-related errors, and allowing for parallel operations.
Below is a detailed explanation of each setting:
1. Part Size for Multipart Upload (bytes)
This parameter controls the size of each part during a multipart upload to Amazon S3. In a multipart upload, a large file is divided into smaller parts, each of which is uploaded independently. The minimum size for each part is 5 MB (5,242,880 bytes), as per AWS S3 requirements.
Purpose: Setting this value determines the size of each part when uploading large files using multipart upload.
Recommendations: For optimal performance, choose a part size that balances between the number of parts and the upload speed. A smaller part size results in more parts, which can impact the efficiency of the upload, especially with very large files.
The default minimum part size for S3 multipart uploads is 5 MB, but depending on your network conditions or file sizes, you can increase this to reduce the number of parts, which may enhance performance.
Example: If you’re uploading a 1 GB file with a part size of 10 MB (10,485,760 bytes), the file will be split into 100 parts.
2. Download Chunk Size (bytes)
This parameter specifies the size of each chunk to be used during a chunked download from Amazon S3. When downloading a large file, it can be broken into smaller pieces (chunks), each of which is downloaded separately.
Purpose: Defines the size of each chunk during the download process. This is useful for handling large files efficiently and for providing a consistent experience even with unstable network conditions.
Recommendations: Choose a chunk size based on your available memory and network stability. Larger chunks may improve download speed but require more memory, while smaller chunks provide better fault tolerance and reduce memory usage.
The appropriate chunk size can vary based on file sizes and the network’s reliability. Larger chunks may be beneficial in high-bandwidth environments, while smaller chunks help in environments with frequent network disruptions.
Example:
If you set the chunk size to 1 MB (1,048,576 bytes), a 500 MB file will be downloaded in 500 separate chunks.
These settings allow for more control over how files are uploaded to and downloaded from Amazon S3, ensuring that large files can be handled efficiently and reliably.
Decryption
When S3 Connection is used as a source (FROM
) in the source-to-destination transformation, it is possible to configure the automatic decryption of the encrypted source files using the PGP algorithm and private key uploaded to the secure key storage.
If the private key is available, all source files processed by the transformation will be automatically decrypted using the PGP algorithm and given key. Note that the private key requires a password.
Comments
0 comments
Please sign in to leave a comment.