Overview
One way to extract data from the nested source objects is to normalize it and create a relational data model.
For example, here is a nested source (JSON document):
[
{
"firstName": "Joe",
"lastName": "Doe",
....
"phones": [
{
"type": "mobile",
"phoneNumber": "4124211111"
}
],
"tests": [
{
"who": "somebody",
"why": "covid19"
}
]
}
]
If you were designing a database for the user's example above, you would typically be creating the following tables:
This is exactly what Etlworks can do for you automatically.
Process
Step 1. In the Flows, click add new Flow
and type in staging
in the Select Flow Type
box.
Step 2. Choose either Extract nested dataset and create staging files
or Extract nested dataset and create staging tables
, depending on what your destination is.
Step 3. Add a source to destination transformation where (left-to-right):
CONNECTION
: the source connection. It can be file storage, cloud storage API, NoSQL database, and message queue.FORMAT
: the data exchange format of the source dataset, for example, JSON.FROM
: the source object name.CONNECTION
: the destination connection. It can be a database, file storage, or cloud storage.FORMAT
: optional destination format (required if the destination is a file), for example, JSON or CSV.TO
: the destination object name. This name will be used as a prefix for all created objects (tables or files).
Step 4. Add a per-field Mapping, if needed.
Step 5.1. If the destination is a database, you can specify the following parameters:
Populate Staging Tables in Parallel
: creating and populating staging tables in parallel (default) is much faster but performed out of order.Drop Staging Tables
: this option controls whether the flow should drop staging tables. Possible values are:true
: always drop staging tables.false
: never drop staging tables.first loop iteration
: drop staging tables on the first iteration of the loop or if flow is not executed in the loop.
Alter Staging Tables
: If this option is enabled, the staging table already exists and the list of columns has changed since the last run the table will be automatically altered by adding new columns. Note that dropping, altering, and renaming columns is not supported.Tables to Exclude
: by default, the flow will create and populate staging tables for all dimensions (inner objects). You can specify a comma-separated list of dimensions to exclude.Dimensions to convert to string
: a comma-separated list of nested dimensions to convert to stings (stringify) in the original format (JSON, XML, etc.). To convert all nested dimensions to strings enterall
in this field.Do not create tables for dimensions converted to string
: if this option is enabled the flow will not create staging tables for dimensions (nested fields) converted to strings.Add Parent name to the staging table name
: this option is enabled the flow will add the parent table name to the staging table name in the formatparent_staging
.Add Parent name as a column
: if this option is enabled the flow will add the parent node name as a column. The name of the column is going to benestedparent
and the value will be set to the parent node name.
Step 5.2. If the destination is a file, you can specify the following parameters:
Create Staging Files in Parallel
: creating staging files in parallel (the default) is much faster but performed out-of-order.Files to Exclude
: by default, the flow will create staging files for all dimensions (inner objects). You can specify a comma-separated list of dimensions to exclude.Dimensions to convert to string
: a comma-separated list of nested dimensions to convert to stings (stringify) in the original format (JSON, XML, etc.). To convert all nested dimensions to strings enterall
in this field.Do not create tables for dimensions converted to string
: if this option is enabled the flow will not create staging files for dimensions (nested fields) converted to strings.Add Parent name to the staging file name
: this option is enabled the flow will add the parent file name to the staging table name in the formatparent_staging
.
How the flow creates staging tables and files
The flow uses the following rules when it creates and populates staging tables and files:
- The flow creates a main table/file with the same name as the destination (
TO
). - The flow creates a separate table/file for each dimension (a nested node in the original document). The name is
destination_dimensionName
. For example,users_phoneNumbers
. IfAdd Parent name
is enabled the flow will add the parent name to the staging table/file name in the format:parentNodeName_dimensionName
. - If
Drop staging tables
is set totrue
the flow always drops exiting staging tables. Enable this option is the flow creates persisted staging tables (in opposite to temporary tables). - If
Drop staging tables
is set tofirst loop iteration
the flow drops exiting staging tables on a first iteration of the loop. Enable this option if the flow is executed in a loop and you want to preserve data populated in each iteration of the loop. - If
Add Parent name as a column
is enabled the flow adds the fieldnestedparentn
to each row of each created table and file. It sets the value of this field to the parent node name. - The flow does not create tables/files for the excluded dimensions.
- The flow sets the values of the nested fields to
null
unless they are included inDimensions to convert to string
. - The flow uses the original field names unless they are changed in the
MAPPING
. - The flow adds the field
nestedrowid
to each row of each created table and file. The value of this field isUUID
. - The flow adds the field
nestedparentrowid
to each row of the dimension (inner dataset). The value of this field is a parent'snestedrowid
.
How to exclude dimensions (nodes)
By default, the flow will create and populate staging tables for all dimensions (inner nodes). You can set a comma-separated list of dimensions to exclude in Tables to Exclude
. For example, let's assume that there is a nested JSON dataset:
[
{
"firstName": "Joe",
"lastName": "Doe",
....
"phones": [
{
"type": "mobile",
"phoneNumber": "4124211111"
}
],
"addresses
": [
{
"type": "home",
"address": "123 Main str, Pittsburgh PA, 15217"
}
],
"contacts": [
{
"who": "somebody",
"why": "covid19"
}
]
}
]
If you want to exclude nodes phones
and addresses
simple add then to Tables to Exclude
as phones,addresses
.
A node with the same name can belong to multiple parent nodes. Adding node to the exclusion list will exclude it from all parent nodes.
How to preserve the original value of the nested field
By default, the Flow converts all nested dimensions into flat tables or files. It is possible to convert nested fields into stringified JSONs or XMLs, hence preserving the original values.
For example, here is a part of the response from the Magento /orders
API endpoint. The nested dimensions are items
, billing_address
, payment
and extension attributes
.
By default, the Flow creates an order
table, converts all nested dimensions to the staging tables with the same names, and sets values of the fields items
, billing_address
, payment
and extension attributes
for each order to null
.
To preserve the original JSON values of the fields items
, billing_address
, payment
and extension attributes
add them to the Dimensions to convert to string
.
It is recommended to enable Do not create tables for dimensions converted to string
to direct the Flow not to create staging tables for these nested fields.
How to handle a variable number of columns in the source
It is quite often when the source contains a variable number of columns in each row. Example:
To create a table that contains a superset of all columns simply enable the option Alter Staging Tables.
Comments
0 comments
Please sign in to leave a comment.