When to use this Flow type
Etlworks can work with files directly. The following file-based operations are supported:
- Copy files
- Move files
- Rename files
- Delete files
- Create folder(s)
- Zip files
- Unzip files
- Check number of files in folder
- All file management operations
When working directly with files, Etlworks does not modify the files. Use extract data from source, transform, and load it to the destination Flow type if you need to change the file.
Create Flow
Here are the steps on how to create Flow:
Step 1. Start creating a Flow by clicking Add flow
on the Flows
window.
Step 2. In the opened box, select Files
.
Step 3. Select one of the file operations above.
Step 4. Continue by adding transformations and configuring Flow parameters.
Copy files
Use this Flow type to copy files between Connections. Here's how you can do this:
Step 1. Start creating the Flow in the Flows
window by clicking +
, and typing in copy files
.
Step 2. Select the source Connection. It can be any one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- WebDAV
- SMB Share
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- HTTP (web service)
- Redis
- Google Sheets
- Google Analytics
- Inbound Email
- MongoDB
Step 3. In the FROM
field, enter the file name, or a wildcard file name, for the file(s) to copy.
Step 4. Select the destination Connection. It can be any one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
- HTTP (web service)
- Redis
- Google Sheets
- Outbound Email
- MongoDB
Step 5. Optionally, enter a new file name or a new wildcard file name into the TO
field. Read how the system calculates a destination file name in file operations.
Step 6. Click MAPPING
, select the Parameters
tab, and modify the following parameters, if necessary:
Add Suffix to the Destination File Name
: you can select one of the predefined suffixes for the files created using this file operation. For example, if you selectuuid
as a suffix and the original filename isdest.csv
, Etlworks will create files with the namedest_uuid.csv
, where UUID is a globally unique identifier such as21EC2020-3AEA-4069-A2DD-08002B30309D
.Do not copy files which have already been copied
: if this option is enabled, Etlworks will not copy files that have already been copied.Capture metrics
: if this option is selected, the information about each processed file will be captured and displayed in the Flow dashboard. Disable if you expect to process a large number of files (> 1K).Wait before moving to next file
: the number of milliseconds to wait before starting to copy the next file. This parameter is used to prevent throttling, specifically when the destination is an HTTP endpoint. Read about battling the throttling.Maximum Simultaneous Operations
: Etlworks can copy each file in its own thread. Use this property to set the maximum number of simultaneous file operations. Read about parallel processing.Maximum Number of Files to Process
: if the value of this property is greater than 0, the Flow will stop copying the files after the number of processed files will reach the configured threshold.On Exception
: by default, any error causes execution to halt. Whenignore
is selected, the errors will be ignored, and execution will continue. Read how to configure the Flow to ignore all or specific exceptions.Exception Mask
: you can specify what errors should be ignored and still halt the execution for all other errors. Enter part or all of the exception string. This field works only when theIgnore On Exception
option is selected.Execute if Error
: if this option is selected, Etlworks will execute this file operation if there is an error. It might be useful if, for example, you want to copy files that haven't been processed yet, due to an error, to a failed folder.
Only copy new files
If you want to copy files that haven't been copied yet, enable the option Do not copy files which have already been copied
under the MAPPING
> Parameters
.
Move files
Using this Flow type, you can move files between Connections. When a file is moved from the source to a destination, it is first copied to the destination, then deleted from the source.
Step 1. Start creating a Flow in the Flows
window by clicking +
, and typing in move files
.
Step 2. Select the source Connection. It can be any one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
- HTTP (web service)
- Redis
- Google Sheets
- Google Analytics
- Inbound Email
- MongoDB
Step 3. In the FROM
field, enter the file name, or a wildcard file name, for the file(s) to move.
Step 4. Select the destination Connection. It can be any one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
- HTTP (web service)
- Google Sheets
- Outbound Email
- Redis
- MongoDB
Step 5. Optionally, enter a new file name or a new wildcard file name into the TO
field. Read how the system calculates a destination file name in file operations.
Step 6. Click MAPPING
, select the Parameters
tab, and modify the following parameters, if necessary:
Add Suffix to the Destination File Name
: you can select one of the predefined suffixes for the files created using this file operation. For example, if you selectuuid
as a suffix and the original filename isdest.csv
, Etlworks will create files with the namedest_uuid.csv
, where UUID is a globally unique identifier such as21EC2020-3AEA-4069-A2DD-08002B30309D
.Do not move files which have already been moved
: if this option is enabled, Etlworks will not move files that have already been moved.Capture metrics
: if this option is selected, the information about each processed file will be captured and displayed in the Flow dashboard. Disable if you expect to process a large number of files (> 1K).Wait before moving to next file
: the number of milliseconds to wait before starting to copy the next file. This parameter is used to prevent throttling, specifically when the destination is an HTTP endpoint. Read about battling the throttling.Maximum Simultaneous Operations
: Etlworks can move each file in its own thread. Use this property to set the maximum number of simultaneous file operations. Read about parallel processing.Maximum Number of Files to Process
: if the value of this property is greater than 0, the Flow will stop moving the files after the number of processed files will reach the configured threshold.On Exception
: by default, any error causes execution to halt. Whenignore
is selected, the errors will be ignored, and execution will continue.Exception Mask
: you can specify what errors should be ignored and still halt the execution for all other errors. Enter part or all of the exception string. This field works only when theIgnore On Exception
option is selected.Execute if Error
: if this option is selected, Etlworks will execute this file operation if there is an error. It might be useful if, for example, you want to copy files that haven't been processed yet, due to an error, to a failed folder.
Move only new files
If you want to move files that haven't been moved yet, enable the option Do not move files which have already been moved
under the MAPPING
> Parameters
.
Rename files
Using this Flow type, you can rename files.
Step 1. Start creating a Flow in the Flows
by clicking +
, and typing in rename files
.
Step 2. Select the source Connection. It can be any one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
- Redis
Step 3. In the FROM
field, enter a file name or a wildcard file name of the file(s) to rename.
Step 4. Select a destination Connection. It can be any one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
- Redis
Step 5. Enter a new file name into the TO
field. Read how the system calculates a destination file name in file operations.
Step 6. Click the MAPPING
button, select the Parameters
tab, and modify the following parameters, if necessary:
Add Suffix to the Destination File Name
: you can select one of the predefined suffixes for the files created using this file operation. For example, if you selectuuid
as a suffix, and the original filename isdest.csv
, Etlworks will create files with the namedest_uuid.csv
, where uuid is a globally unique identifier such as21EC2020-3AEA-4069-A2DD-08002B30309D
.Capture metrics
: if this option is selected, the information about each processed file will be captured and displayed in the Flow dashboard. Disable if you expect to process a large number of files (> 1K).Maximum Number of Files to Process
: if the value of this property is greater than 0, the Flow will stop renaming the files after the number of processed files will reach the configured threshold.On Exception
: by default, any error causes execution to halt. Whenignore
is selected, the errors will be ignored, and execution will continue.Exception Mask
: you can specify what errors should be ignored while still halting execution for all other errors. Enter part or all of the exception string. This field works only when theIgnore On Exception
option is selected.Execute if Error
: if this option is selected, Etlworks will execute this file operation if there is an error. It might be useful if, for example, you want to copy files that haven't been processed yet, due to an error, to a failed folder.
Delete files
Using this Flow type, you can delete files.
Step 1. Start creating a Flow in the Flows
by clicking +
, and typing in delete files
.
Step 2. Select the source Connection. It can be any one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
- HTTP (web service): Etlworks will attempt to execute HTTP requests using
HTTP DELETE
method. - Redis
Step 3. In the FROM
field, enter a file name, or a wildcard file name of the file(s) to delete.
Step 4. Click MAPPING
, select the Parameters
tab, and modify the following parameters, if necessary:
Wait before moving to next file
: number of milliseconds to wait before starting to delete the next file. This parameter is used to prevent throttling, specifically when the destination is an HTTP endpoint. Read about battling the throttling.Capture metrics
: if this option is selected, the information about each processed file will be captured and displayed in the Flow dashboard. Disable if you expect to process a large number of files (> 1K).Maximum Number of Files to Process
: if the value of this property is greater than 0, the Flow will stop deleting the files after the number of processed files will reach the configured threshold.On Exception
: by default, any error causes execution to halt. Whenignore
is selected, the errors will be ignored, and execution will continue.Exception Mask
: you can specify what errors should be ignored while still halting execution for all other errors. Enter part or all of the exception string. This field works only when theIgnore On Exception
option is selected.Execute if Error
: if this option is selected, Etlworks will execute this file operation when an error occurs. It might be useful if, for example, you want to delete the files that haven't been processed yet, due to the error.
Create folder(s)
Using this Flow type, you can create a folder in any supported file storage.
Step 1. Start creating a Flow in the Flows
by clicking +
, and typing in create folder
.
Step 2. Select the source Connection. It can be either one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
Step 3. In the FROM
field, enter the name of the folder to create. Use folder1/folder2/foldern
to create multiple nested folders.
The folder will be created under the base Directory
, specified in the Connection.
Step 4. Click MAPPING
, select the Parameters
tab, and modify the following parameters, if necessary:
On Exception
: by default, any error causes execution to halt. Whenignore
is selected, the errors will be ignored, and execution will continue.Exception Mask
: you can specify what errors should be ignored while still halting execution for all other errors. Enter part or all of the exception string. This field works only when theIgnore On Exception option
is selected.Execute if Error
: if this option is selected, Etlworks will execute this file operation when any error occurs.
Zip files
Using this Flow type, you can create an archived file in the zip
or gzip
Format.
Step 1. Start creating a Flow in the Flows
by clicking +
button, and typing in zip files
.
Step 2. Select the source Connection. It should have the following Connection type:
Step 3. In the FROM
field enter the file name, or a wildcard file name of the file(s) to zip.
Step 4. Select the destination Connection. It should be the following Connection type:
Step 5. Enter the name of the archived file into the TO
field.
By default, Etlworks creates archived files in the Zip
Format. Enter a file name with the extension gzip
to create an archive in the gzip
Format.
Step 6. Click MAPPING
, select the Parameters
tab, and modify the following parameters, if necessary:
Action
: choose eitherZip
orZip and Delete
. The latter will delete the original source files after creating an archived file.Add Suffix to the Zip File Name
: you can select one of the predefined suffixes for the files created using this file operation. For example, if you selectuuid
as a suffix and the original filename isdest.zip
, Etlworks will create a file with the namedest_uuid.zip
, where uuid is a globally unique identifier such as21EC2020-3AEA-4069-A2DD-08002B30309D
.Capture metrics
: if this option is selected, the information about each processed file will be captured and displayed in the Flow dashboard. Disable if you expect to process a large number of files (> 1K).Maximum Simultaneous Operations
: Etlworks can process each file in its own thread. Use this property to set the maximum number of simultaneous file operations. Read about parallel processing.Maximum Number of Files to Process
: if the value of this property is greater than 0, the Flow will stop archiving the files after the number of processed files will reach the configured threshold.Zip password
: optional password for the Zip file. Onlyzipped
files can be password protected.On Exception
: by default, any error causes execution to halt. Whenignore
is selected, the errors will be ignored, and execution will continue.Exception Mask
: you can specify what errors should be ignored while still halting execution for all other errors. Enter part or all of the exception string. This field works only when theIgnore On Exception
option is selected.Execute if Error
: if this option is selected, Etlworks will execute this file operation when any error occurs.
Unzip file
Using this Flow type, you can unzip an archived file which is in the zip
and gzip
Formats.
Step 1. Start creating a Flow in the Flows
window by clicking +
, and typing in unzip files
.
Step 2. Select the source Connection. It should have the following Connection type:
Step 3. In the FROM
field, enter the archived file name. The wildcard file names are supported.
Step 4. Select the destination Connection. It should have the following Connection type:
Step 5. Click MAPPING
, select the Parameters
tab, and modify the following parameters, if necessary:
Action
: choose one of the following:Unzip
,Unzip and Delete
,UnGZip
orUnGZip and Delete
. Actions withDelete
will delete the original, archived file.Capture metrics
: if this option is selected, the information about each processed file will be captured and displayed in the Flow dashboard. Disable if you expect to process a large number of files (> 1K).Maximum Simultaneous Operations
: Etlworks can process each file in its own thread. Use this property to set the maximum number of simultaneous file operations. Read about parallel processing.Zip password
: optional password for the Zipped file.Do not create subfolders when unzipping files with nested folders
: if this option is enabled and the zip file has nested subfolders, the Flow will unzip files directly to the destination folder without creating subfolders. If this option is disabled (default), the Flow will create the subfolder for each folder in the zip file.On Exception
: by default, any error causes execution to halt. Whenignore
is selected, the errors will be ignored, and execution will continue.Exception Mask
: you can specify what errors should be ignored while still halting execution for all other errors. Enter part or all of the exception string. This field works only when theIgnore On Exception
option is selected.Execute if Error
: if this option is selected, Etlworks will execute this file operation when any error occurs.
Check number of files in folder
Using this Flow type, you can compare the number of files in a folder with a given constant or enter a boolean expression. When using a constant, if the number of files is not equal to what was expected, Etlworks will generate an exception. The most common use case is to generate an exception if the actual number of matching files is not 0
.
When using an expression, Etlworks evaluates the expression, and if the returned value is a boolean false, it throws an Exception. Example of the boolean expression:
filesCount > 5
Step 1. Start creating a Flow in the Flows
by clicking +
and, typing in check number of files
.
Step 2. Select the source Connection. It can be either one of the following Connection types:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
Step 3. In the FROM
field, enter the file name or the wildcard file name to look for.
Step 4. Click MAPPING
, select the Parameters
tab and modify the following parameters, if necessary:
Expected Number of Files or Expression
: enter the expected number of files, for example0
, or a boolean expression, for example,filesCount > 5
. The expression can be any JavaScript or Python code that returns boolean false or true. The following objects are available by reference:filesCount
: the actual number of files which name matches the entered file name or the wildcard file name.files
:java.utils.ArrayList
containing files which name matches the entered file name or the wildcard file name.
On Exception
: by default, any error causes execution to halt. Whenignore
is selected, the errors will be ignored, and execution will continue.Exception Mask
: you can specify what errors should be ignored while still halting execution for all other errors. Enter part or all of the exception string. This field works only when theIgnore On Exception
option is selected.Execute if Error
: if this option is selected, Etlworks will execute this file operation when any error occurs.
All file management operations
The File Management Flow allows you to create multiple transformations, such as copy, move, rename, etc., within a single Flow. You can also choose to use specific Flow types: copy, move, etc.,
Step 1. Create the source (FROM
) and destination (TO
) Connections, which can be one of the following:
- Amazon S3
- Google Cloud Storage
- Microsoft Azure Storage
- Server Storage
- FTP
- FTPS
- SFTP
- Box
- Dropbox
- Google Drive
- OneDrive for Business
- SharePoint
- WebDAV
- SMB Share
- HTTP (web service)
- Redis
- Google Sheets
- Google Analytics
- Inbound Email
- MongoDB
Step 2. Select a File Management
Flow type from the list.
Step 3. Create a transformation where the source (FROM
) is a location with source files and the destination (TO
) is a location where files should be created (which can be the same method used for the rename operation).
Step 4. Select or enter the source and destination file names in the Mappings
box. Both can be wildcard names, such as *.csv
.
Step 5. Continue by specifying the transformation parameters:
Action
: file operation, such asCopy
,Move
,Rename
orDelete
the files,Create Folder(s)
,Check number of files
in the folder.Do not process files which have already been processed
: if this option is enabled, Etlworks will not process files that have already been processed.Capture metrics
: if this option is selected, the information about each processed file will be captured and displayed in theFlow
dashboard. Disable if you expect to process a large number of files (> 1K).Wait before moving to the next file
: the number of milliseconds to wait before starting to process the next file. This parameter is used to prevent throttling, especially when the destination is an HTTP endpoint. Read about battling the throttling.Maximum Simultaneous Operations
: Etlworks can process each file in its own thread. Use this property to set the maximum number of simultaneous file operations. If not set, the default is10
. Read about parallel processing.Maximum Number of Files to Process
: if the value of this property is greater than 0 the Flow will stop processing the files after the number of processed files will reach the configured threshold.On Exception
: by default, any error halts execution. Whenignore
is selected, the errors will be ignored, and execution will continue.Exception Mask
: you can specify what errors should be ignored while still halting execution for all other errors. Enter part or all of the exception string. This field works only whenIgnore On Exception
option is selected.Execute if Error
: if this option is selected and an error occurs, Etlworks will execute the chosen file operation. It can be useful if, for example, you want to move files that haven't been processed due to an error to the failed folder.
When selecting FROM
and TO
:
- For
Copy
actions: choose the source (FROM
) and destination (TO
) Connections. Enter the file name, or a wildcard file name, such as*.csv
, into the source (FROM
) field. The files from the source location (FROM
Connection) will be copied to the destination location (TO
Connection). - For
Move
actions: choose the source (FROM
) and destination (TO
) Connections (they can be the same). Enter the file name, or a wildcard file name, such as*.csv
, into the source (FROM
) field. Files from the source location (FROM
Connection) will be moved to the destination location (TO
Connection). - For
Rename
actions: choose the source (FROM
) and destination (TO
) Connections (they can be the same). Enter the file name, or a wildcard file name, such as*.csv
, into the source (FROM
) field. Enter a new file name, or a wildcard file name, such asdest.*
, into the destination (TO
) field. Files from the source location (FROM
Connection) will be moved to the destination location (TO
Connection) and renamed in the process. - For
Delete
actions: choose the source (FROM
) Connection. Enter the file name, or a wildcard file name, such as*.csv
, into the source (FROM
) field. Files in the source location (FROM
Connection), which match the string entered in theFROM
field, will be deleted. - For
Create Folder(s)
actions: choose the source (FROM
) Connection. Enter the name of the folder to be created into theFROM
field. If that folder doesn't exist, it will be created under the URL/Directory
of theFROM
Connection. - For
Check Number of Files
actions: choose the source (FROM
) Connection. Enter the file name, or a wildcard file name, such as*.csv
, into the source (FROM
) field. The system will calculate the number of files whose names match theFROM
field, compare it to the entered number of files, and generate an exception if they are not equal.
Filter files using a script
It is quite typical when you must filter (or group) files by filename, size, or other attributes, create buckets and process each bucket independently.
In Etlworks, you can use a script flow to create buckets. Then, instead of providing a file name or a wildcard filename in file operations, use a bucket name.
Example: create a flow that moves files smaller than 10KB to the 'small files' folder and those larger or equal to 10KB to the 'large files' folder.
Also read about file loop with filter.
Step 1. Create a new script flow.
Step 2. Add a named source connection to the flow. Use the connection that contains the source files. The name that you assign to the connection is going to be used in the next step.
Step 3. Add the JavaScript or Python code to filter or split files in buckets. Below is an example in JavaScript which splits the files in large and small files buckets.
// Get the list of the files in the folder matching the wildcard
// For the connection, use the same name as you assigned
// to the named connection is step 2
var list = com.toolsverse.etl.core.task.common.FileManagerTask.list(etlConfig,
'source', '*.*');
// create buckets
var largeFiles = new java.util.ArrayList();
var smallFiles = new java.util.ArrayList();
// split files in buckets
if (list != null) {
for each (var file in list) {
if (file.getSize()>=10000) {
largeFiles.add(file);
} else {
smallFiles.add(file);
}
}
}
// add named buckets to the common object storage
etlConfig.setValue('large_files', largeFiles);
etlConfig.setValue('small_files', smallFiles);
Available file attributes:
file.getName()
- the name of the file with an extension but without the folder.file.getSize()
- the size of the file in bytes.file.getPath()
- the full path.file.getLastModified()
- the timestamp in Unix epoch when the file was modified.
Read about commonly used packages and classes (with examples).
Step 4. Create a new flow to process files. It can be any of the following:
Step 5. Add a source-to-destination transformation for each bucket.
When creating a transformation, enter list(bucket_name)
in FROM
.
Set DESTINATION CONNECTION
to the connection which points to the specific destination folder. Or you can reuse the same connection by adding the destination folder name in TO
. Example: /large/*
.
Step 6. Combine flows created in steps 1 and 4 into the nested flow.
Enable parallel file operations
When processing files by a wildcard name, for example *.csv
, it is possible to configure parallel processing by setting the value of the parameter Maximum Simultaneous Operations
to greater than 1.
The following file operations support parallel processing:
if you set the parameter Wait before moving to next file
to the value greater than zero, the parallel processing will be automatically disabled. This parameter controls how long the system should wait before processing the next file and is designed to battle the throttling enforced by the owners of the third-party services.
Battle the throttling
Some owners of the third-party services enforce the throttling, designed to prevent a large number of requests from happening in a short time frame.
In Etlworks, you can configure the parameter Wait before moving to next file
by setting it to a value greater than zero. This is causing the system to wait a configured amount of time before processing the next file in the queue.
Additionally, you can set the value of the property Maximum Number of Files to Process
to greater than zero. If the value of this property is greater than 0, the Flow will stop processing the files after the number of processed files will reach the configured threshold. This property is useful if you have a large number of files in the queue (thousands) and would prefer to process them in chunks.
Process files in the specific order
When processing the files by a wildcard filename, for example, *.csv
, the file management Flows (Copy
, Move
, Rename
, Delete
, Zip
) first capture the list of files to process, then sort the list. The list is sorted using the selected algorithm for the source Connection.
Available Options:
Disabled
: default sorting for the Connection, mostly likely by filename with ascending order.oldest
: oldest files first.newest
: newest files first.ascending
: by filename with ascending order.descending
: by filename with descending order.largest
: largest files first.smallest
: smallest files first.
Comments
0 comments
Please sign in to leave a comment.