Etlworks reads, writes, transforms, and moves files in any of the supported data exchange formats — CSV, JSON, XML, Excel, Avro, Parquet, EDI, HL7, and more. Files can live on cloud storage, server storage, FTP/FTPS/SFTP, WebDAV, SMB, Box, Dropbox, Google Drive, OneDrive, SharePoint, NoSQL stores like Redis or MongoDB, or in an email server.
What can you do with files in Etlworks?
| Use case | How |
|---|---|
| ETL with files | Files as a source or destination in a source-to-destination transformation. |
| File management | Copy, move, rename, delete, zip, unzip, create folders — see File operations. |
| Split or merge files | Split a large file into smaller chunks, or merge multiple files into one. |
| Transform XML with XSLT | Apply an XSL stylesheet to one or more XML files. |
| Process EDI / HL7 messages | HL7: all common formats and protocols. EDI: X12, EDIFACT, and others. |
| Expose a dataset as a REST endpoint | Building custom APIs — the PULL pattern returns a file's content (or a dataset) as an HTTP response. |
Connections for files, NoSQL, and email
Pick the connector that matches where your files live. All of these support file-based and ETL flows.
- Cloud storage: Amazon S3, Google Cloud Storage, Microsoft Azure Storage
- Server / local storage: Server Storage
- FTP family: FTP, FTPS, SFTP
- SaaS storage: Box, Dropbox, Google Drive, OneDrive for Business, SharePoint
- Network protocols: WebDAV, SMB Share
- Key-value / NoSQL: Redis, MongoDB
- Email: Inbound email, Outbound email
Test a connection
Click Test Connection on the connection screen. Etlworks connects to the storage and attempts to list files using the configured folder and filename. Wildcard filenames are supported.
Supported file formats
- CSV / Fixed Length Text
- JSON / JSON dataset
- XML / XML dataset
- Excel
- HL7 2.x / HL7 FHIR
- X12 and EDIFACT
- Key=Value
- Avro / Parquet
- CLOB
- PDF / HTML
Browse files and view data in Etlworks Explorer
- Create a file connection.
- Create a format matching the file.
- Open Etlworks Explorer, select the connection, and link the format.
- Browse files and fields as metadata, view rows in a grid, run SQL-like queries, inspect dependencies.
ETL with files
File operations
- Copy, move, rename, delete, zip, unzip files; create folders
- Transform XML files using XSLT
- Split files
- Merge files
- File operations — supported operations, auto-decompress, auto-archive
- Destination file name in file operations
Upload local files
To upload a file from your machine into Etlworks, create a connection and use Etlworks Explorer to upload. Supported connection types for local-file upload:
Amazon S3, Google Cloud Storage, Microsoft Azure Storage, Server Storage, FTP, FTPS, SFTP, WebDAV, HTTP, Box, Dropbox, Google Drive, OneDrive for Business, SharePoint.
- Create the connection.
- Open Etlworks Explorer and upload the file.
- Create a flow that uses the connection as a source.
Use server storage as a staging area
Every Etlworks account gets its own Home folder on Etlworks servers. Most of your files are probably in cloud or remote storage (S3, FTP, SFTP, …), but staging copies locally first can be faster for some workloads. After processing, you can delete the staged copies.
- Create a server storage connection.
- Copy or move files from cloud or remote storage to the server storage.
- Process files from server storage.
- Delete the staged files.
Process files in an ETL flow using wildcard filenames
Etlworks can process multiple files in a folder using a wildcard pattern. This section covers the wildcard option in ETL transformations where the source is one or more files.
For related patterns, see:
- Process files using file operations (copy / move / etc.)
- File loop
- Per-warehouse bulk-load flows (Snowflake, Redshift, BigQuery, …) — see Working with cloud data warehouses.
Setup
- Create a source connection.
- Create a destination connection.
- Create a flow that reads a file and loads it into the destination.
- In the source-to-destination transformation, enter a wildcard filename (e.g. *.csv) in the FROM field.
-
Optional: enter a comma-separated list of files to exclude or to include.
Handling processed files
By default, processed files remain in the source folder. Three post-processing options are available, applied in priority order if more than one is enabled:
-
Delete loaded source files (highest priority) — removes the file after a successful load. Optionally enable Delete source files on error to remove files if the load fails too. Under MAPPING → Parameters.
-
Skip previously processed files — keeps files in place but tracks which have been processed, so future runs skip them. Configure under MAPPING → Parameters:
- Skip Previously Processed Files — only new or modified files are processed.
- File Retention in Cache (ms) — how long processed-file metadata is cached.
- Custom Cache File Name — override the default cache filename.
-
Move processed files to (lowest priority) — moves the file to another folder after a successful load. Set under Flow → Connections tab. Ignored if either of the two options above is enabled.
Configure destination name (TO)
Default behavior. The TO field can include *. The flow replaces * with the source filename (without extension).
Example. FROM = *.json, TO = public.*. With files test1.json, test2.json, test3.json, the flow creates / updates tables public.test1, public.test2, public.test3.
Programmatic destination naming. Use a regular expression or JavaScript in Calculate Destination Object Name under MAPPING → Parameters. The flow auto-detects whether the input is JavaScript or a regex.
Regex example. Extract the part of the filename between - and _cdc_stream:
- Source (FROM): *-*_cdc_stream_*.csv
- Destination (TO): *
- Calculate Destination Object Name: -(.*?)_cdc_stream
JavaScript example. The last evaluated expression replaces * in TO:
var start = name.indexOf('-');
var end = name.indexOf('_cdc_stream');
value = name.substring(start + 1, end);
Process files in a specific order
When wildcards match multiple files, the source connection's File processing order setting decides which file comes first.
| Option | What it does |
|---|---|
| Disabled | Connection default — typically filename ascending. |
| oldest / newest | Sorted by modification time, oldest or newest first. |
| ascending / descending | Sorted by filename, A→Z or Z→A. |
| largest / smallest | Sorted by file size, largest or smallest first. |
Folders and relative paths
Every file-based and cloud-storage connector lets you configure a Directory. The Directory is the root folder for all ETL and file-based operations using that connection. Relative paths in flow FROM / TO fields are resolved relative to it.
The Directory can include multiple subfolders, e.g. demo/data or archive/2025/reports. If empty, the storage system's default root is used.
Example. Directory = demo/data. FROM = /sales/*.csv resolves to /demo/data/sales/*.csv.
Connectors that support relative paths and folder auto-creation
| Connector | ETL FROM/TO | File-based FROM/TO | Auto-create folders |
|---|---|---|---|
| Amazon S3 | Since 7.x | Since 7.x | N/A |
| Google Cloud Storage | Since 8.1.5 | Since 8.1.5 | N/A |
| Microsoft Azure Storage | Since 7.x | Since 7.x | N/A |
| Server Storage | Since 8.1.5 | Since 7.x | Since 8.1.5 |
| FTP | Since 8.1.5 | Since 8.1.5 | Since 8.1.5 |
| FTPS | Since 8.1.5 | Since 8.1.5 | Since 8.1.5 |
| SFTP | Since 7.x | Since 7.x | Since 8.1.5 |
| Box | Since 8.1.5 | Since 8.1.5 | Since 8.1.5 |
| Dropbox | Since 8.1.5 | Since 8.1.5 | Since 8.1.5 |
| Google Drive | Since 8.1.5 | Since 8.1.5 | Since 8.1.5 |
| OneDrive for Business | Since 7.x | Since 7.x | Since 8.1.5 |
| SharePoint | Since 7.x | Since 7.x | Since 8.1.5 |
| WebDAV | Since 8.1.5 | Since 8.1.5 | Since 8.1.5 |
| SMB Share | Since 8.1.5 | Since 8.1.5 | Since 8.1.5 |
Relative-path examples
All examples use a connection whose Directory is /opt/data.
ETL flow:
- FROM = /03012020/*.csv, TO = /archive/03012020/ — reads from /opt/data/03012020 and writes to /opt/data/archive/03012020.
- FROM = /01012020/data.csv, TO = /processed/data.csv — reads /opt/data/01012020/data.csv, writes /opt/data/processed/data.csv.
File-based flow:
- Copy Files: FROM = /02012020/*.csv, TO = /backup/02012020/* — copies /opt/data/02012020/*.csv to /opt/data/backup/02012020.
- Move Files: FROM = /03012020/*.csv, TO = /archive/03012020/*.
- Delete Files: FROM = /01012020/*.csv.
- Create Folder: FROM = /new/data/2025/ — creates the full path under the Directory.
Create Folder flow
The Create Folder flow creates multiple nested folders in one operation. Running it with path /test/data creates every missing folder under the base directory.
Automatic folder creation
All file-based and cloud-storage connectors include an Automatically Create Missing Folders option. When on, the connector creates any missing folders during ETL or file-based operations (Copy Files and Move Files). Off by default. Enable it when your flow needs to create new subfolders on the fly.
Append rows to an existing file
By default, Etlworks overwrites the destination file every time a flow writes to it. To append new rows to an existing file, use a temporary database as a staging area.
How it works
- Load the existing file into a temporary table.
- Load the new rows into the same temporary table.
- Write the combined contents back to the destination file.
Step 1. Create a temporary database connection
Create a temporary (staging) database connection. Recommended settings:
- Always enclose table/column names in double quotes — tolerates non-standard column names.
- Varchar Field Size Multiplier = 1000 — prevents errors on very long string columns.
- All columns are varchars — keeps data as-is with no type conversion.
Step 2. Create the flow
Open Flows, click Add flow, and pick Any to any ETL.
Step 3. Load the existing file into the staging table
Add the first source-to-destination transformation:
- Source connection — connection for the existing destination file.
- Source format — format matching the existing file.
- FROM — the existing file (the one you want to append to).
- Destination connection — the temporary database from Step 1.
- TO — a staging table name (e.g. staging). The table is created when the flow runs and dropped when it finishes.
Go to Configure → Source is a File and enable Ignore when there is no file so the flow still runs the first time, when the destination file doesn't exist yet.
Step 4. Load the new rows into the same staging table
Add a second source-to-destination transformation:
- Source connection — connection for the new file(s).
- Source format — same format as the destination file.
- FROM — the new file or wildcard (e.g. *.csv).
- Destination connection — the temporary database.
- TO — the same staging table name as in Step 3.
Step 5. Write the combined data back to the file
Add a final source-to-destination transformation:
- Source connection — the temporary database.
- FROM — the staging table name.
- Destination connection — the original destination file connection.
- Destination format — the original file format.
- TO — the original destination filename.
Removing duplicate rows (optional)
If appending might produce duplicates, filter them with a Source Query on the final transformation. Use SQL like SELECT DISTINCT * FROM staging or a more specific window-function query depending on your data.
Tip: keep all fields as strings during staging
Enable All fields are strings in both source formats (Steps 3 and 4) to prevent automatic type conversion during extract. Values are preserved exactly as they appear in the file.
Disable automatic field-data-type detection
Etlworks detects column data types from the first few rows of a CSV. This usually works, but in edge cases it picks the wrong type:
id, first_name, last_name 1, Joe, Doe 2, Jane, Doe AB, Sima, The Dog
From the first two rows, the system infers id as NUMERIC. The third row's AB then fails the database load.
Two fixes:
- Manually set the column data type in mapping — precise but per-column.
- Enable All fields are strings on the source format — all columns become VARCHAR, length set to the database's maximum. Useful for staging tables.
Working with very large files
For very large files, three techniques help:
- Split files into smaller chunks. Splitting uses minimal RAM and CPU.
- Performance tips when working with files and web services.
- Merge files on the way out if you need a single output.
Read gzipped and zipped files in ETL transformations
Etlworks reads .gz and .zip files directly when the file is a source in an ETL transformation — no separate extract step. This is called inline unzip.
Requirements:
- The file has a .gz or .zip extension.
- The archive contains a single file (multi-file ZIPs aren't supported in this path).
Set FROM to the .gz or .zip path. Wildcard filenames are supported.
Inline unzip also works in Etlworks Explorer.
Capture or modify the filename in a source-to-destination transformation
Almost every format in Etlworks (CSV, JSON, …) can be configured with a JavaScript function to capture or modify the source / destination filename. The function runs only when the format is used in a source-to-destination ETL transformation.
Variables available in the function
- fileName — filename calculated by the connector.
- isReading — true when running for the source, false for the destination.
- batchNumber — integer batch number when the connector splits the file, otherwise -1.
Modify the filename
Return a string to use as the new filename. Example:
value = com.toolsverse.util.FilenameUtils.getBaseName(fileName) + '_test' + '.csv';
Capture the filename for later use
Store the filename in a global variable, a flow variable, or flow storage. Example for a global variable:
com.toolsverse.config.SystemConfig.instance().getProperties()
.put('file_to_read', fileName);
Expose a file as an API endpoint
Use the PULL pattern in Building custom APIs to expose a file or a dataset as a REST endpoint.
Related resources
- Automatically encrypt files in server storage — PGP encryption for files created in ETL flows that land in server storage.
- Create XML and JSON documents using JavaScript.
- Filter and modify rows in a source CSV file.
- Web / HTML scraping with JavaScript and Python.