About nested datasets
A nested dataset is a semi-structured document in any supported data exchange format (such as XML, JSON, X12, HL7, etc.) where some elements of the data are nested inside other elements of the data.
The input and output of different APIs can range from simple and flat to highly nested and complex.
Example
Here is an example of a nested JSON returned by the web service:
[
{
"firstName":"Joe",
"lastName":"Doe",
"ids":[
{
"type":"ssn",
"id":"101111222"
}
],
"age":18,
"streetAddress":"100 Internet Dr",
"city":"Pittsburgh",
"state":"PA",
"postalCode":"12345",
"phoneNumbers":[
{
"type":"Mobile",
"phoneNumber":"111-111-1111"
},
{
"type":"Home",
"phoneNumber":"222-222-2222"
}
]
}
]
In relational databases and flat files (such as CSV, Excel, etc.), the data is represented as datasets with columns and rows:
EMPNO FIRSTNME MIDINIT LASTNAME WORKDEPT phoneNumbers
10 CHRISTINE I HAAS A00 3978
20 MICHAEL L THOMPSON B01 3476
30 SALLY A KWAN C01 4738
50 JOHN B GEYER E01 6789
60 IRVING F STERN D11 6423
70 EVA D PULASKI D21 7831
What if the phoneNumbers
is an array of phones?
"phoneNumbers":[
{
"type":"Mobile",
"phoneNumber":"111-111-1111"
},
{
"type":"Home",
"phoneNumber":"222-222-2222"
}
]
As you can see below, phoneNumbers
can be easily represented as a dataset:
type phoneNumber
mobile 111-111-1111
home 222-222-2222
Now, if we say that the phoneNumbers
column has an ARRAY
data type, and each value in a row is a dataset, we have found a way to encode nested data objects.
This is exactly how nested data objects are represented in Etlworks Integrator.
There is no limit to nestedness in Etlworks Integrator, so each column anywhere in the nested object can be a nested object itself.
Terminology
Dimension
: In Etlworks Integrator, the inner datasets (phoneNumbers
in the example above) of the nested dataset are also called dimensions.Path
: a comma-separated list of dimensions in the formatparent_of_parent,parent,child
that are used to specify the path to the data element in the nested dataset to extract data from.Driving dataset
: The outer dataset (the root nested object) is called a driving dataset.
Work with nested datasets
- Mapping when the Source is a nested dataset
- Use SQL to extract data from a nested dataset
- Create staging tables or flat files for each dimension of the nested dataset
- Flatten nested dataset using Extract Dimensions transformation
- Transform nested XML using XSLT
- Flatten source XML with repeating segments
- Create nested documents
Streaming is disabled when working with nested datasets
Etlworks Integrator can read and write nested data objects, which are encoded in all the supported data exchange Formats, such as JSON, XML, HL7, etc.
By default, Etlworks Integrator attempts to use streaming, which is another way of saying, "please keep as little data in memory as possible." Generally speaking, streaming is faster and has a very low memory footprint. Unfortunately, streaming is not possible when reading complex nested data objects, which are typically returned by web services, so Etlworks Integrator automatically disables it:
If the source is a nested dataset and the destination is a database, it is highly recommended to enable the Force Streaming option. When streaming is forced, the system automatically creates a stream-able view from a non-stream-able source. It still keeps all the records in memory but does not keep all the SQL statements in memory, and it is possible to use bind variables with a stream-able view.
Comments
0 comments
Please sign in to leave a comment.