Output and data-quality transformations – Etlworks Support

Output-side and data-quality operations — control how data lands at the destination, filter out duplicates, validate rows, or keep the dataset in memory for downstream steps. Configure each one under Transformation → MAPPING.

Partition-by

Split a large dataset into smaller output files. Available only when the destination is a file.

Two modes, picked by the value type in the Partition By field:

Partition by record count. Enter a numeric value — the maximum number of records per file. With Partition By = 100 and 1,000 source records, you get 10 files of 100 records each. Filenames: original_name + _ + index + original_extension.

Partition by field values. Enter a comma-separated list of column names. Etlworks creates one file per unique combination of those columns. With Partition By = last_name,first_name, every unique (last, first) pair gets its own file. Filenames: original_name + _ + partition_value + original_extension.

Ignore the original filename. Default filename pattern includes the source filename, e.g. order_1234.csv. To get clean per-partition filenames without that prefix, enable MAPPING → Complex Transformations → Ignore Original File Name.

Configure: MAPPING → Complex Transformations → Partition.

Remove duplicates

Drop subsequent records that match an earlier record on a configurable set of fields. Etlworks compares each incoming row to what it has already processed and ignores matches.

Configure: MAPPING → Complex Transformations → Remove Duplicates. Enter the comma-separated list of fields that define a duplicate.

Validation

Define rules that reject a row, reject the entire dataset, or halt the flow when source data fails a check — e.g., required-field missing, value out of range, type mismatch.

Configure: MAPPING → Additional Transformations → Validation. Specify the rule set and the failure action (reject row / reject dataset / halt flow).

Memory Connection as destination

A Memory Connection holds an entire dataset in RAM during flow execution. Almost any source-to-destination transformation can use a Memory Connection as its destination — the transformation extracts from the source and stores the result in memory, where other transformations in the same flow (or nested flows) can read it.

When to use it:

Lookups and enrichment against a small or medium reference dataset.
Caching reference data once and reusing it across multiple transformations.
Parsing a web service response before writing it elsewhere.
Calculations spanning multiple transformations.
Passing datasets between a parent flow and nested flows.
Avoiding temporary staging tables for short-lived data.

Constraint. The dataset must fit in RAM. Use staging tables instead for larger datasets.

Setup:

Create a connection of type Memory Connection.
Create or open a flow.
Add a source-to-destination transformation. Source = any supported source (database, file, API, …). Destination = the Memory Connection from step 1.
Name the transformation — the name is how downstream transformations reference the dataset.
Downstream transformations in the same flow can use the Memory Connection as their source.

Articles in this section

Partition-by

Remove duplicates

Validation

Memory Connection as destination

Related articles