Overview
If you want to keep all the data from the nested source object, including the data in all dimensions, you need to normalize the nested data object and create a relational data model.
For example, here is a nested source (JSON document):
[
{
"firstName": "Joe",
"lastName": "Doe",
....
"phones": [
{
"type": "mobile",
"phoneNumber": "4124211111"
}
],
"tests": [
{
"who": "somebody",
"why": "covid19"
}
]
}
]
If you were designing a database for the user's example above, you would typically be creating the following tables:
This is exactly what the Etlworks Integrator can do for you automatically.
Process
Step 1. Start by adding a new Flow and typing staging
into the Select Flow Type
box.
Step 2. Choose either Extract nested dataset and create staging files
or Extract nested dataset and create staging tables
, depending on what your destination is.
Step 3. Add a source to destination transformation where (left-to-right):
CONNECTION
: the source connection. It can be file storage, cloud storage API, NoSQL database, and message queue.FORMAT
: the data exchange format of the source dataset, for example, JSON.FROM
: the source object name.CONNECTION
: the destination connection. It can be a database, file storage, or cloud storage.FORMAT
: optional destination format (required if the destination is a file), for example, JSON or CSV.TO
: the destination object name. This name will be used as a prefix for all created objects (tables or files).
Step 4. Add a per-field Mapping, if needed.
Step 5.1. If the destination is a database, you can specify the following parameters:
Populate Staging Tables in Parallel
: creating and populating staging tables in parallel (default) is much faster but performed out of order.Drop Staging Tables
: if this option is enabled (the default), the Etlworks Integrator will drop any existing staging tables.Alter Staging Tables
: If this option is enabled, the staging table already exists and the list of columns has changed since the last run the table will be automatically altered by adding new columns. Note that dropping, altering, and renaming columns is not supported.Tables to Exclude
: by default, the Etlworks Integrator will create and populate staging tables for all dimensions (inner objects). You can specify a comma-separated list of dimensions to exclude.Dimensions to convert to string
: a comma-separated list of nested dimensions to convert to stings (stringify) in the original format (JSON, XML, etc.). To convert all nested dimensions to strings enterall
in this field.Do not create tables for dimensions converted to string
: if this option is enabled the flow will not create staging tables for dimensions (nested fields) converted to strings.Add Parent name to the staging table name
: this option is enabled the flow will add the parent table name to the staging table name in the formatparent_staging
.Add Parent name as a column
: if this option is enabled the flow will add the parent node name as a column. The name of the column is going to benestedparent
and the value will be set to the parent node name.
Step 5.2. If the destination is a file, you can specify the following parameters:
Create Staging Files in Parallel
: creating staging files in parallel (the default) is much faster but performed out-of-order.Files to Exclude
: by default, the Etlworks Integrator will create staging files for all dimensions (inner objects). You can specify a comma-separated list of dimensions to exclude.Dimensions to convert to string
: a comma-separated list of nested dimensions to convert to stings (stringify) in the original format (JSON, XML, etc.). To convert all nested dimensions to strings enterall
in this field.Do not create tables for dimensions converted to string
: if this option is enabled the flow will not create staging files for dimensions (nested fields) converted to strings.Add Parent name to the staging file name
: this option is enabled the flow will add the parent file name to the staging table name in the formatparent_staging
.
Rules used when the flow creates staging tables and files
The Etlworks Integrator uses the following rules when it creates and populates staging tables and files:
- The Etlworks Integrator creates a main table/file with the same name as the destination (
TO
). - The Etlworks Integrator creates a separate table/file for each dimension (a nested node in the original document). The name is
destination_dimensionName
. For example,users_phoneNumbers
. IfAdd Parent name
is enabled the flow will add the parent name to the staging table/file name in the format:parentNodeName_dimensionName
. - If
Add Parent name as a column
is enabled the Etlworks Integrator adds the fieldnestedparentn
to each row of each created table and file. It sets the value of this field to the parent node name. - The Etlworks Integrator does not create tables/files for the excluded dimensions.
- The Etlworks Integrator sets the values of the nested fields to
null
unless they are included inDimensions to convert to string
. - The Etlworks Integrator uses the original field names unless they are changed in the
MAPPING
. - The Etlworks Integrator adds the field
nestedrowid
to each row of each created table and file. The value of this field isUUID
. - The Etlworks Integrator adds the field
nestedparentrowid
to each row of the dimension (inner dataset). The value of this field is a parent'snestedrowid
.
How to preserve the original value of the nested field
By default, the Flow converts all nested dimensions into flat tables or files. It is possible to convert nested fields into stringified JSONs or XMLs, hence preserving the original values.
For example, here is a part of the response from the Magento /orders
API endpoint. The nested dimensions are items
, billing_address
, payment
and extension attributes
.
By default, the Flow creates an order
table, converts all nested dimensions to the staging tables with the same names, and sets values of the fields items
, billing_address
, payment
and extension attributes
for each order to null
.
To preserve the original JSON values of the fields items
, billing_address
, payment
and extension attributes
add them to the Dimensions to convert to string
.
It is recommended to enable Do not create tables for dimensions converted to string
to direct the Flow not to create staging tables for these nested fields.
How to handle a variable number of columns in the source
It is quite often when the source contains a variable number of columns in each row. Example:
To create a table that contains a superset of all columns simply enable the option Alter Staging Tables.
Comments
0 comments
Please sign in to leave a comment.