Normalize nested dataset as flat tables or files – Etlworks Support

Overview

Most APIs, files, and document stores hand you data in nested form — one big JSON or XML payload with arrays inside arrays inside objects. Etlworks can unpack that into a relational shape automatically: drop the source in, get N flat tables (or N flat files), one per nested entity, with auto-generated row IDs that preserve parent-child relationships.

For example, the source below produces three tables — users, users_phones, and users_tests — each populated with its own rows and tied back to its parent by a relationship column.

[
    {
        "firstName": "Joe",
        "lastName": "Doe",
        ....
        "phones": [
            { "type": "mobile", "phoneNumber": "4124211111" }
        ],
        "tests": [
            { "who": "somebody", "why": "covid19" }
        ]
    }
]

Supported sources and destinations

Source connections

File storage and cloud storage (local, NAS, S3, Azure Blob, GCS, etc.)
HTTP and REST APIs
NoSQL databases and document stores (MongoDB and others)
Message queues (Kafka, RabbitMQ, etc.)

Source formats

JSON
XML
Avro and Parquet
HTML (parsed as nested tables)
Text formats (CSV, TSV, fixed-width — treated as flat single-table sources)
Excel (XLSX and XLS) with multiple worksheets — new in 9.5.x
PDF with multiple tables or sections — new in 9.5.x

Destinations

Two reciprocal flow templates handle the two destination shapes:

Extract nested dataset and create staging tables — writes to a relational database. Supports every database Etlworks supports as a destination, including warehouses (Snowflake, BigQuery, Redshift, Synapse, etc.).
Extract nested dataset and create staging files — writes to file storage or cloud storage in your chosen output format (CSV, JSON, XML, and so on).

Create the flow

In Flows, click Add new Flow, type staging in the Select Flow Type box, and pick the variant that matches your destination — staging tables (database) or staging files.

Configure the source-to-destination transformation

Add a single source-to-destination transformation to the flow:

Source connection — where the nested payload comes from.
Source format — JSON, XML, Excel, PDF, etc. (see Source formats).
FROM — the source object name (file name, queue topic, API endpoint, etc.).
Destination connection — a database (for staging tables) or file/cloud storage (for staging files).
Destination format — required only when writing files. Pick CSV, JSON, XML, or anything else you want as the per-entity output.
TO — the base name used for all created objects. Each entity is named TO_entityName by default.

Optionally add a per-field mapping if you want to rename, calculate, or exclude specific fields.

Multi-sheet Excel and multi-table PDF sources

New in 9.5.x. When the source is an Excel workbook or a PDF document, the flow automatically expands the file into one entity per worksheet (Excel) or one entity per detected table or section (PDF). Each worksheet or table becomes its own staging table or staging file, no per-sheet configuration required.

Excel — one staging table per worksheet, named after the worksheet. Works for XLSX and XLS. The selector on the Excel Format can still pin to a single sheet or a subset (see Excel Formats → Selecting worksheets).
PDF — one staging table per detected table or section. PDF table detection is heuristic; well-structured PDFs with clear tables work best.

How tables and files are named

The flow uses these rules:

The top-level entity gets the same name as the destination object (TO).
Each nested entity gets TO_entityName — for example, users_phones.
When Add Parent name to the staging table/file name is enabled, the immediate parent name is used as the prefix instead, giving parentEntity_entityName. This matters when the same entity name appears under multiple parents.
Names are sanitized to be valid SQL identifiers when the destination is a database.

Auto-added relationship columns

Every staging table and every staging file has these extra columns added automatically, so you can rebuild the relationships in SQL or downstream tooling:

nestedrowid — a UUID per row, unique within the table.
nestedparentrowid — the parent row's nestedrowid. Join on this column to rebuild parent-child links.
nestedparent — the parent entity's name, added only when Add Parent name as a column is enabled. Useful when the same dimension belongs to multiple parents and you want to disambiguate at row level.

No SQL foreign-key constraints are created — staging tables are designed for fast load and disposability.

Database-destination options

When the destination is a database, the flow exposes these parameters:

Populate Staging Tables in Parallel — on by default. Significantly faster, at the cost of out-of-order writes.
Drop Staging Tables —
- true — always drop existing staging tables before populating.
- false — never drop. Use this for incremental appends.
- first loop iteration — drop on the first iteration of a loop only; subsequent iterations append. Use this when the flow runs in a loop and you want one fresh load per run.
Alter Staging Tables — when on, the flow adds any new columns it discovers to an existing staging table without dropping it. Useful for sources whose schema evolves run-to-run. Adding columns is supported; dropping, renaming, and type changes are not.
Tables to Exclude — comma-separated list of dimension names. Excluded dimensions are not written to staging.
Dimensions to convert to string — comma-separated list of dimensions whose nested value should be preserved as a stringified JSON or XML rather than expanded into its own table. Set to all to apply to every dimension.
Do not create tables for dimensions converted to string — pairs with the option above; recommended when you only need the stringified value.
Add Parent name to the staging table name — see naming rules above.
Add Parent name as a column — adds the nestedparent column.

File-destination options

When the destination is file or cloud storage, the same idea applies with a smaller set of parameters:

Create Staging Files in Parallel — on by default.
Files to Exclude — comma-separated list of dimensions to skip.
Dimensions to convert to string — same behavior as the database version.
Do not create files for dimensions converted to string — same pairing.
Add Parent name to the staging file name — same naming rule.

Excluding dimensions

By default the flow creates an entity for every nested dimension. To skip some, list them in Tables to Exclude (or Files to Exclude). Given the source below:

[
    {
        "firstName": "Joe",
        "lastName": "Doe",
        ....
        "phones":    [ { "type": "mobile", "phoneNumber": "4124211111" } ],
        "addresses": [ { "type": "home",   "address": "123 Main st, Pittsburgh PA, 15217" } ],
        "contacts":  [ { "who":  "somebody", "why":     "covid19" } ]
    }
]

Set the exclusion list to phones,addresses and the flow writes only the parent table and the contacts dimension. Excluded dimensions disappear from the output entirely.

Note: when a dimension name occurs under more than one parent, excluding it removes it from every parent.

Preserve a nested field as a string

Some sources have nested objects that are easier to work with as raw JSON or XML downstream than as separate tables — for example, the items, billing_address, payment, and extension_attributes fields on a Magento order, which the consumer wants to keep as JSON blobs on the parent row.

To preserve those fields:

Add them to Dimensions to convert to string (comma-separated list, or all for every dimension).
Enable Do not create tables for dimensions converted to string so the flow does not also create empty tables for them.

The parent table then carries the original JSON or XML value of each preserved field in a single column.

Schema evolution (variable columns)

If different rows in the source have different fields — for example, an API that returns a sparse JSON object — enable Alter Staging Tables. The flow then unions the field sets across rows and adds missing columns to the destination table as it goes, producing a single table with the superset of all columns.

Existing columns are never renamed, retyped, or dropped — only added.

Performance notes

Parallel population — on by default. Disable only if the destination database serializes parallel writes or table creation.
Automatic intermediate-level flattening (new in 9.5.x) — the engine no longer emits intermediate container rows for source levels that aren't referenced in the destination. The result: fewer redundant records during normalization and faster runs on deeply-nested sources. Nothing to configure; the optimization runs automatically.
Deep nesting — very deep source structures can produce many small staging tables. Consider excluding deep leaf dimensions or converting them to string when you don't need the full relational expansion.

Articles in this section