Get started with files – Etlworks Support

Etlworks reads, writes, transforms, and moves files in any of the supported data exchange formats — CSV, JSON, XML, Excel, Avro, Parquet, EDI, HL7, and more. Files can live on cloud storage, server storage, FTP/FTPS/SFTP, WebDAV, SMB, Box, Dropbox, Google Drive, OneDrive, SharePoint, NoSQL stores like Redis or MongoDB, or in an email server.

What can you do with files in Etlworks?

Use case	How
ETL with files	Files as a source or destination in a source-to-destination transformation.
File management	Copy, move, rename, delete, zip, unzip, create folders — see File operations.
Split or merge files	Split a large file into smaller chunks, or merge multiple files into one.
Transform XML with XSLT	Apply an XSL stylesheet to one or more XML files.
Process EDI / HL7 messages	HL7: all common formats and protocols. EDI: X12, EDIFACT, and others.
Expose a dataset as a REST endpoint	Building custom APIs — the PULL pattern returns a file's content (or a dataset) as an HTTP response.

Connections for files, NoSQL, and email

Pick the connector that matches where your files live. All of these support file-based and ETL flows.

Cloud storage: Amazon S3, Google Cloud Storage, Microsoft Azure Storage
Server / local storage: Server Storage
FTP family: FTP, FTPS, SFTP
SaaS storage: Box, Dropbox, Google Drive, OneDrive for Business, SharePoint
Network protocols: WebDAV, SMB Share
Key-value / NoSQL: Redis, MongoDB
Email: Inbound email, Outbound email

Test a connection

Click Test Connection on the connection screen. Etlworks connects to the storage and attempts to list files using the configured folder and filename. Wildcard filenames are supported.

Test connection

Supported file formats

Browse files and view data in Etlworks Explorer

Create a file connection.
Create a format matching the file.
Open Etlworks Explorer, select the connection, and link the format.
Browse files and fields as metadata, view rows in a grid, run SQL-like queries, inspect dependencies.

ETL with files

File operations

Upload local files

To upload a file from your machine into Etlworks, create a connection and use Etlworks Explorer to upload. Supported connection types for local-file upload:

Amazon S3, Google Cloud Storage, Microsoft Azure Storage, Server Storage, FTP, FTPS, SFTP, WebDAV, HTTP, Box, Dropbox, Google Drive, OneDrive for Business, SharePoint.

Create the connection.
Open Etlworks Explorer and upload the file.
Create a flow that uses the connection as a source.

Use server storage as a staging area

Every Etlworks account gets its own Home folder on Etlworks servers. Most of your files are probably in cloud or remote storage (S3, FTP, SFTP, …), but staging copies locally first can be faster for some workloads. After processing, you can delete the staged copies.

Create a server storage connection.
Copy or move files from cloud or remote storage to the server storage.
Process files from server storage.
Delete the staged files.

Process files in an ETL flow using wildcard filenames

Etlworks can process multiple files in a folder using a wildcard pattern. This section covers the wildcard option in ETL transformations where the source is one or more files.

For related patterns, see:

Process files using file operations (copy / move / etc.)
File loop
Per-warehouse bulk-load flows (Snowflake, Redshift, BigQuery, …) — see Working with cloud data warehouses.

Setup

Create a source connection.
Create a destination connection.
Create a flow that reads a file and loads it into the destination.
In the source-to-destination transformation, enter a wildcard filename (e.g. *.csv) in the FROM field.
Optional: enter a comma-separated list of files to exclude or to include.

Handling processed files

By default, processed files remain in the source folder. Three post-processing options are available, applied in priority order if more than one is enabled:

Delete loaded source files (highest priority) — removes the file after a successful load. Optionally enable Delete source files on error to remove files if the load fails too. Under MAPPING → Parameters.
Skip previously processed files — keeps files in place but tracks which have been processed, so future runs skip them. Configure under MAPPING → Parameters:
- Skip Previously Processed Files — only new or modified files are processed.
- File Retention in Cache (ms) — how long processed-file metadata is cached.
- Custom Cache File Name — override the default cache filename.
Move processed files to (lowest priority) — moves the file to another folder after a successful load. Set under Flow → Connections tab. Ignored if either of the two options above is enabled.

Configure destination name (TO)

Default behavior. The TO field can include *. The flow replaces * with the source filename (without extension).

Example. FROM = *.json, TO = public.*. With files test1.json, test2.json, test3.json, the flow creates / updates tables public.test1, public.test2, public.test3.

Programmatic destination naming. Use a regular expression or JavaScript in Calculate Destination Object Name under MAPPING → Parameters. The flow auto-detects whether the input is JavaScript or a regex.

Calculate Destination Object Name

Regex example. Extract the part of the filename between - and _cdc_stream:

Source (FROM): *-*_cdc_stream_*.csv
Destination (TO): *
Calculate Destination Object Name: -(.*?)_cdc_stream

JavaScript example. The last evaluated expression replaces * in TO:

var start = name.indexOf('-');
var end = name.indexOf('_cdc_stream');
value = name.substring(start + 1, end);

Process files in a specific order

When wildcards match multiple files, the source connection's File processing order setting decides which file comes first.

File processing order

Option	What it does
Disabled	Connection default — typically filename ascending.
oldest / newest	Sorted by modification time, oldest or newest first.
ascending / descending	Sorted by filename, A→Z or Z→A.
largest / smallest	Sorted by file size, largest or smallest first.

Folders and relative paths

Every file-based and cloud-storage connector lets you configure a Directory. The Directory is the root folder for all ETL and file-based operations using that connection. Relative paths in flow FROM / TO fields are resolved relative to it.

Connection Directory

The Directory can include multiple subfolders, e.g. demo/data or archive/2025/reports. If empty, the storage system's default root is used.

Example. Directory = demo/data. FROM = /sales/*.csv resolves to /demo/data/sales/*.csv.

Connectors that support relative paths and folder auto-creation

Connector	ETL FROM/TO	File-based FROM/TO	Auto-create folders
Amazon S3	Since 7.x	Since 7.x	N/A
Google Cloud Storage	Since 8.1.5	Since 8.1.5	N/A
Microsoft Azure Storage	Since 7.x	Since 7.x	N/A
Server Storage	Since 8.1.5	Since 7.x	Since 8.1.5
FTP	Since 8.1.5	Since 8.1.5	Since 8.1.5
FTPS	Since 8.1.5	Since 8.1.5	Since 8.1.5
SFTP	Since 7.x	Since 7.x	Since 8.1.5
Box	Since 8.1.5	Since 8.1.5	Since 8.1.5
Dropbox	Since 8.1.5	Since 8.1.5	Since 8.1.5
Google Drive	Since 8.1.5	Since 8.1.5	Since 8.1.5
OneDrive for Business	Since 7.x	Since 7.x	Since 8.1.5
SharePoint	Since 7.x	Since 7.x	Since 8.1.5
WebDAV	Since 8.1.5	Since 8.1.5	Since 8.1.5
SMB Share	Since 8.1.5	Since 8.1.5	Since 8.1.5

Relative-path examples

All examples use a connection whose Directory is /opt/data.

ETL flow:

FROM = /03012020/*.csv, TO = /archive/03012020/ — reads from /opt/data/03012020 and writes to /opt/data/archive/03012020.
FROM = /01012020/data.csv, TO = /processed/data.csv — reads /opt/data/01012020/data.csv, writes /opt/data/processed/data.csv.

File-based flow:

Copy Files: FROM = /02012020/*.csv, TO = /backup/02012020/* — copies /opt/data/02012020/*.csv to /opt/data/backup/02012020.
Move Files: FROM = /03012020/*.csv, TO = /archive/03012020/*.
Delete Files: FROM = /01012020/*.csv.
Create Folder: FROM = /new/data/2025/ — creates the full path under the Directory.

Create Folder flow

The Create Folder flow creates multiple nested folders in one operation. Running it with path /test/data creates every missing folder under the base directory.

Create Folder flow

Automatic folder creation

All file-based and cloud-storage connectors include an Automatically Create Missing Folders option. When on, the connector creates any missing folders during ETL or file-based operations (Copy Files and Move Files). Off by default. Enable it when your flow needs to create new subfolders on the fly.

Automatically create missing folders

Append rows to an existing file

By default, Etlworks overwrites the destination file every time a flow writes to it. To append new rows to an existing file, use a temporary database as a staging area.

How it works

Load the existing file into a temporary table.
Load the new rows into the same temporary table.
Write the combined contents back to the destination file.

Step 1. Create a temporary database connection

Create a temporary (staging) database connection. Recommended settings:

Always enclose table/column names in double quotes — tolerates non-standard column names.
Varchar Field Size Multiplier = 1000 — prevents errors on very long string columns.
All columns are varchars — keeps data as-is with no type conversion.

Temp database connection settings

Step 2. Create the flow

Open Flows, click Add flow, and pick Any to any ETL.

Any to any ETL flow

Step 3. Load the existing file into the staging table

Add the first source-to-destination transformation:

Source connection — connection for the existing destination file.
Source format — format matching the existing file.
FROM — the existing file (the one you want to append to).
Destination connection — the temporary database from Step 1.
TO — a staging table name (e.g. staging). The table is created when the flow runs and dropped when it finishes.

First transformation

Go to Configure → Source is a File and enable Ignore when there is no file so the flow still runs the first time, when the destination file doesn't exist yet.

Ignore when there is no file

Step 4. Load the new rows into the same staging table

Add a second source-to-destination transformation:

Source connection — connection for the new file(s).
Source format — same format as the destination file.
FROM — the new file or wildcard (e.g. *.csv).
Destination connection — the temporary database.
TO — the same staging table name as in Step 3.

Second transformation

Step 5. Write the combined data back to the file

Add a final source-to-destination transformation:

Source connection — the temporary database.
FROM — the staging table name.
Destination connection — the original destination file connection.
Destination format — the original file format.
TO — the original destination filename.

Final transformation

Removing duplicate rows (optional)

If appending might produce duplicates, filter them with a Source Query on the final transformation. Use SQL like SELECT DISTINCT * FROM staging or a more specific window-function query depending on your data.

Source query for dedup

Tip: keep all fields as strings during staging

Enable All fields are strings in both source formats (Steps 3 and 4) to prevent automatic type conversion during extract. Values are preserved exactly as they appear in the file.

All fields are strings

Disable automatic field-data-type detection

Etlworks detects column data types from the first few rows of a CSV. This usually works, but in edge cases it picks the wrong type:

id, first_name, last_name
1, Joe, Doe
2, Jane, Doe
AB, Sima, The Dog

From the first two rows, the system infers id as NUMERIC. The third row's AB then fails the database load.

Two fixes:

Manually set the column data type in mapping — precise but per-column.
Enable All fields are strings on the source format — all columns become VARCHAR, length set to the database's maximum. Useful for staging tables.

All fields are strings

Working with very large files

For very large files, three techniques help:

Split files into smaller chunks. Splitting uses minimal RAM and CPU.
Performance tips when working with files and web services.
Merge files on the way out if you need a single output.

Read gzipped and zipped files in ETL transformations

Etlworks reads .gz and .zip files directly when the file is a source in an ETL transformation — no separate extract step. This is called inline unzip.

Requirements:

The file has a .gz or .zip extension.
The archive contains a single file (multi-file ZIPs aren't supported in this path).

Set FROM to the .gz or .zip path. Wildcard filenames are supported.

Inline unzip in transformation

Inline unzip also works in Etlworks Explorer.

Inline unzip in Explorer

Capture or modify the filename in a source-to-destination transformation

Almost every format in Etlworks (CSV, JSON, …) can be configured with a JavaScript function to capture or modify the source / destination filename. The function runs only when the format is used in a source-to-destination ETL transformation.

Filename function on a format

Variables available in the function

fileName — filename calculated by the connector.
isReading — true when running for the source, false for the destination.
batchNumber — integer batch number when the connector splits the file, otherwise -1.

Modify the filename

Return a string to use as the new filename. Example:

value = com.toolsverse.util.FilenameUtils.getBaseName(fileName) + '_test' + '.csv';

Capture the filename for later use

Store the filename in a global variable, a flow variable, or flow storage. Example for a global variable:

com.toolsverse.config.SystemConfig.instance().getProperties()
    .put('file_to_read', fileName);

Expose a file as an API endpoint

Use the PULL pattern in Building custom APIs to expose a file or a dataset as a REST endpoint.

Related resources

Automatically encrypt files in server storage — PGP encryption for files created in ETL flows that land in server storage.
Create XML and JSON documents using JavaScript.
Filter and modify rows in a source CSV file.
Web / HTML scraping with JavaScript and Python.

Articles in this section

What can you do with files in Etlworks?

Connections for files, NoSQL, and email

Test a connection

Supported file formats

Browse files and view data in Etlworks Explorer

ETL with files

File operations

Upload local files

Use server storage as a staging area

Process files in an ETL flow using wildcard filenames

Setup

Handling processed files

Configure destination name (TO)

Process files in a specific order

Folders and relative paths

Connectors that support relative paths and folder auto-creation

Relative-path examples

Create Folder flow

Automatic folder creation

Append rows to an existing file

How it works

Step 1. Create a temporary database connection

Step 2. Create the flow

Step 3. Load the existing file into the staging table

Step 4. Load the new rows into the same staging table

Step 5. Write the combined data back to the file

Removing duplicate rows (optional)

Tip: keep all fields as strings during staging

Disable automatic field-data-type detection

Working with very large files

Read gzipped and zipped files in ETL transformations

Capture or modify the filename in a source-to-destination transformation

Variables available in the function

Modify the filename

Capture the filename for later use

Expose a file as an API endpoint

Related resources

Related articles