What is nested dataset
A nested dataset is a semi-structured document in any supported data exchange format (such as XML, JSON, X12, HL7, etc.) where some elements of the data are nested inside other elements of the data.
Example
Here is an example of a nested JSON returned by the web service:
[
{
"firstName":"Joe",
"lastName":"Doe",
"ids":[
{
"type":"ssn",
"id":"101111222"
}
],
"age":18,
"streetAddress":"100 Internet Dr",
"city":"Pittsburgh",
"state":"PA",
"postalCode":"12345",
"phoneNumbers":[
{
"type":"Mobile",
"phoneNumber":"111-111-1111"
},
{
"type":"Home",
"phoneNumber":"222-222-2222"
}
]
}
]
In relational databases and flat files (such as CSV, Excel, etc.), the data is represented as datasets with columns and rows:
EMPNO FIRSTNME MIDINIT LASTNAME WORKDEPT phoneNumbers
10 CHRISTINE I HAAS A00 3978
20 MICHAEL L THOMPSON B01 3476
30 SALLY A KWAN C01 4738
50 JOHN B GEYER E01 6789
60 IRVING F STERN D11 6423
70 EVA D PULASKI D21 7831
What if the phoneNumbers
is an array of phones?
"phoneNumbers":[
{
"type":"Mobile",
"phoneNumber":"111-111-1111"
},
{
"type":"Home",
"phoneNumber":"222-222-2222"
}
]
As you can see below, phoneNumbers
can be easily represented as a dataset:
type phoneNumber
mobile 111-111-1111
home 222-222-2222
Now, if we say that the phoneNumbers
column has an ARRAY
data type, and each value in a row is a dataset, we have found a way to encode nested data objects.
This is precisely how nested data objects are represented in Etlworks.
There is no limit to nestedness in Etlworks, so each column anywhere in the nested object can be a nested object itself.
Terminology
Dimension
: In Etlworks, the inner datasets (phoneNumbers
in the example above) of the nested dataset are also called dimensions.Path
: a comma-separated list of dimensions in the formatparent_of_parent,parent,child
that are used to specify the path to the data element in the nested dataset to extract data from.Driving dataset
: The outer dataset (the root nested object) is called a driving dataset.
Data exchange formats that support nested documents
Work with nested datasets
Etlworks includes several techniques for reading and creating nested documents.
- Nested mapping
- Mapping when the source is a nested and the destination is flat
- Use SQL to extract data from a nested dataset
- Normalize nested dataset as flat tables or files
- Create nested documents using JavaScript
- Flatten nested dataset using Extract Dimensions transformation
- Transform nested XML using XSLT
- Flatten source XML with repeating segments
Streaming is disabled
Etlworks can read and write nested documents in JSON, XML, JSON, Parquet, Avro, EDI, and other supported formats.
By default, it attempts to use streaming to load data into the destination. Streaming is faster and has a very low memory footprint. Unfortunately, streaming cannot be used when reading and writing complex nested data objects, so Etlworks automatically disables it.
If the source is a nested dataset and the destination is a database, it is highly recommended to enable the Force Streaming option. When streaming is forced, the system automatically creates a stream-able view from a non-stream-able source. It still keeps all the records in memory but does not keep all the SQL statements in memory, and it is possible to use bind variables with a stream-able view.
Comments
0 comments
Please sign in to leave a comment.