When to use this transformation
Use SQL transformation to extract data from the relational databases or transform data by running a SQL on unstructured or semi-structured data sources, such as a file, a response from the API, a data feed from the social network, etc.
Create a transformation
When creating a source-destination transformation, enter a query into the Source query
field on the Mappings
tab.
Write SQL queries when working with relational databases
Use native SQL syntax of the source relational database. In addition to SELECT SQL
statements, you can execute stored procedures and functions as long as they return a cursor.
Write SQL queries when working with files and API endpoints
It is possible to use SQL to extract data from the nested source dataset (such as JSON or XML) and convert it to a flat or nested array.
Applying an SQL transformation to the nested dataset will create an array of columns and rows which can be mapped, for example, to the table in the relational database.
Writing SQL queries that will be executed on a dataset of any type is not much different from writing queries that will be executed on a relational database:
SQL syntax
- Each query starts with the
select
keyword, followed by a comma-separated list of fields to include in the query. - It is possible to select all fields by using
*
, as inselect *
. - Fields with spaces must be surrounded by double quotes:
"first name"
. - It is possible to use aliases for fields, such as
"first name" first_name
. - It is possible to use mathematical expressions to calculate field values, such as
salary + bonus compensation
. - Typically there is no
FROM
(expectFROM
to be used when selecting data from a nested data object). The following query will work just fine:select first, last
. - String literals must be enclosed in single quotes:
'john'
. - Use the
where condition
for filtering out records in a dataset:select * where salary > 50000
. - Use
field != null
to compare a field value with null. - It is possible to use JavaScript functions in expressions for calculating field values and in the where clause:
select * where first_name.toLowerCase().trim() = 'john'
. - It is possible to use aggregation functions, such as
SUM
,AVG
,MIN
,MAX
,COUNT
to calculate field values. Using aggregation functions requires using thegroup by
expression. - Use
order by fields
for ordering datasets. Theasc and desc
modifiers are allowed:select * where salary > 50000 order by salary desc
. - It is important to understand the order of operations:
- First, the
where clause
conditions are applied. - Then
group by
is applied. - Then fields will be calculated.
- Finally,
order by
is applied.
- First, the
Order of columns
The built-in SQL engine does not always return columns in the same order as the order of columns in the SELECT
statement. The following rules are applied:
- If the column already exists in the source, the engine will keep the position of the column in the source.
- If it is a new column or new alias for the existing column, the engine will place the column in the position defined by the position in the
SELECT
statement.
For example, for the source
OrderNum, OrderDate, Amount
the following SQL statement
select Amount, OrderDate, OrderNum
will still return columns in the original order
OrderNum, OrderDate, Amount
and the SQL statement
select Amount amt, OrderDate order_date, OrderNum order_num
will return columns in the following order:
amt, order_date, order_num
Basically, if you care about the order of columns, always use aliases with the different (than columns) names.
Aggregation functions
Suppose that the source data object is an Excel file, containing order items:
The goal is to read a source Excel file and load it into two database tables:
- Order:
OrderNum
,OrderDate
,OrderAmount
. WhereAmount
is calculated as aSUM(amount)
for all items in the order. - Item:
OrderNum
,OrderDate
,Supplier
,SKU
,Amount
.
To split a single source dataset into two, we will be using the technique described here.
To create the first transformation, which populates the Order
table, simply write a Source SQL query that includes an aggregation function, as in the example below:
select "Order Number" OrderNum, "Order Date" as OrderDate, sum(amount) OrderAmount
group by "Order Number"
We are using the group by
expression together with SUM(amount)
.
The following aggregation functions are supported:
SUM(field)
: sum of all aggregated values of the field.MIN(field)
: minimum value of the field.MAX(field)
: maximum value of the field.AVG(field)
: average value of the field.COUNT(*)
: number of records.
The group by
expression can include one or more comma-separated field names (s) to group the dataset.
Expressions are not supported as parameters for aggregation functions.
Set operations
The following set operations are supported when writing SQL extracting data from files, API endpoints, and social websites:
UNION
: theUNION
operator is used to combine the result sets of two or moreSELECT
statements. UnlikeUNION
operator used when running a query against relational databases, it does not remove duplicate rows between the variousSELECT
statements, so it works similarly toUNION ALL
.INTERSECT
: theINTERSECT
operator is used to return the results of two or moreSELECT
statements. However, it only returns the rows selected by all queries or data sets. If a record exists in one query and not in the other, it will be omitted from theINTERSECT
results.MINUS
: return all rows in the firstSELECT
statement that are not returned by the secondSELECT
statement.
Write queries for nested data objects.
Let's assume that the JSON document below is returned by the web service.
{
"requestId":"185d6#14b51985ff0",
"success":true,
"result":[
{
"id":"433455454",
"name":"all fields",
"searchableFields":[
{
"name":"field1",
"displayName":"Just field 1",
"dataType":"string",
"length":36,
"updateable":false
},
{
"name":"field2",
"displayName":"Just field 2",
"dataType":"string",
"length":50,
"updateable":false
},
],
"fields":[
{
"name":"leadId",
"displayName":"Lead Id",
"dataType":"integer",
"updateable":false
},
{
"name":"role",
"displayName":"Role",
"dataType":"string",
"length":50,
"updateable":false
},
{
"name":"isPrimary",
"displayName":"Is Primary",
"dataType":"boolean",
"updateable":true
},
{
"name":"externalCreatedDate",
"displayName":"External Created Date",
"dataType":"datetime",
"updateable":true
}
]
}
]
}
The goal is to extract all searchableFields
and fields
from the nested JSON and create a flat CSV file.
This can be accomplished by using the following SQL query, which is applied to the nested source JSON:
select name, displayName, 'searchable' as fieldType from result.searchableFields union select name, displayName, 'regular' as fieldType from result.fields
The idea is to provide a path to the nested dimension in the FROM
clause. The elements of the path are separated by a comma (.
). Using the example above, the path to the nested attributes under the searhableFields
is result.searhableFields
and the path to the nested attributes under the fields
is result.fields
.
If you want to include some of the fields from theparent
object, as well as fields from thenested object
, use theowner.
qualifier:
select result.name as apiName, searchableFields.name, searchableFields.displayName,
'searchable' as fieldType
from result.searchableFields
union
select result.name as apiName, fields.name, fields.displayName,
'regular' as fieldType
from result.fields
Write queries when the column may not exist in the data object
Let's assume that there is a JSON data object like below:
{
"type": "type",
"data": {
"unitId": "1234",
"type": "test1",
"odometer": 12,
"timestamp": "01/01/2023"
}
}
You only need data from the "data" node so the query will look below:
select
unitId,
odometer,
type,
timestamp
from data
It is possible that in some data objects, the "odometer" attribute is missing:
{
"type": "type",
"data": {
"unitId": "456",
"type": "test2",
"timestamp": "02/02/2023"
}
}
If it is the case, the query above will fail. The workaround is to replace odometer
with dataSet.getFieldValue(dataSet.getRecord(0), 'odometer') as odometer
:
select
unitId,
dataSet.getFieldValue(dataSet.getRecord(0), 'odometer') as odometer,
type,
timestamp
from data
Write queries when nested node may not exist in the data object
In a typical nested JSON or XML, any node can be potentially empty or even not exist for some rows. Example:
<root>
<row>
<nested_node1>
<field1/>
<field2>abc</field2>
</nested_node1>
<nested_node2>
<field1/>
<field2>xyz</field2>
</nested_node2>
</row>
<row>
<nested_node1>
<field1/>
<field2>abc</field2>
</nested_node1>
</row>
<row>
<nested_node2>
<field1/>
<field2>abc</field2>
</nested_node2>
</row>
</root>
If you will try to retrieve field1
and field12
for the <nested_node1>
using SQL below:
select nested_node1.field1,nested_node1.field2 from root.row
or
select nested_node1.fieldFieldValue(0, 'field1'),
nested_node1.fieldFieldValue(0, 'field2')
from root.row
you will get a null pointer exception because for some of the records the <nested_node1>
is null (does not exist).
To solve the problem use getNullableFieldValue
.
select dataSet.getNullableFieldValue(nested_node1).getFieldValue(0, 'field1'),
dataSet.getNullableFieldValue(nested_node1).getFieldValue(0, 'field2')
from root.row
If there are multiple levels of nestedness, starting from the first nested field use:
dataSet.getNullableFieldValue(firstNestedField).
getNullableFieldValue(0, "field name in the middle").
getFieldValue(0, "last field name")
The first call to dataSet.getNullableFieldValue
with a single argument will return an empty dataset if the firstNestedField
is null.
The second call to getNullableFieldValue
with a row number (0
) and field-in-a-middle name will do the same if the field-in-a-middle is null.
The last call to getNullableFieldValue
will return null if the last-field
is null, just like we want it.
Comments
0 comments
Please sign in to leave a comment.