When to use this transformation
Use SQL transformation to extract data from the relational databases or to transform data by running a SQL on unstructured or semi-structured data sources, such as a file, a response from the API, a data feed from the social network, etc.
Creating a transformation
When creating a source-destination transformation, enter a query into the Source query field on the Mapping tab.
Writing SQL queries when working with relational databases
Use native SQL syntax of the source relational database. In addition to SELECT SQL statements, you can execute stored procedures and functions as long as they return a cursor.
Writing SQL queries when working with files and API endpoints
Writing SQL queries which will be executed on a dataset of any type is not much different from writing queries which will be executed on a relational database:
- Each query starts with the select keyword, followed by a comma-separated list of fields to include in the query.
- It is possible to select all fields by using
*
, as inselect *
. - Fields with spaces must be surrounded by double quotes:
"first name"
. - It is possible to use aliases for fields, such as
"first name" first_name
- It is possible to use mathematical expressions to calculate field values, such as
salary + bonus compensation
. - Typically there is no from (expect "from" to be used when selecting data from a nested data object). The following query will work just fine:
select first, last
. - String literals must be enclosed in single quotes:
'john'
. - Use the where condition for filtering out records in a dataset:
select * where salary > 50000
. - Use
field != null
to compare a field value with null. - It is possible to use JavaScript functions in expressions, for calculating field values and in the where clause:
select * where first_name.toLowerCase().trim() = 'john'
- It is possible to use aggregation functions, such as SUM, AVG, MIN, MAX, COUNT to calculate field values. Using aggregation functions requires using the group by expression.
- Use order by fields for ordering datasets. The
asc and desc
modifiers are allowed:select * where salary > 50000 order by salary desc
. - It is important to understand an order of operations:
- First, the where clause conditions are applied.
- Then group by is applied.
- Then fields will be calculated.
- Finally, order by is applied.
Order of columns
The built-in SQL engine does not always return columns in the same order as the order of columns in the SELECT statement. The following rules are applied:
- If the column already exists in the source, the engine will keep the position of the column in the source.
- If it is a new column or new alias for the existing column, the engine will place the column in the position defined by the position in the SELECT statement.
For example, for the source
OrderNum, OrderDate, Amount
the following SQL statement
select Amount, OrderDate, OrderNum
will still return columns in the original order
OrderNum, OrderDate, Amount
and the SQL statement
select Amount amt, OrderDate order_date, OrderNum order_num
will return columns in the following order
amt, order_date, order_num
Basically, if you care about the order of columns, always use aliases with the different (than columns) names.
Using aggregation functions
Suppose that the source data object is an Excel file, containing order items:
The goal is to read a source Excel file and load it into two database tables:
- Order -
OrderNum
,OrderDate
,OrderAmount
. WhereAmount
is calculated as aSUM(amount)
for all items in the order. - Item -
OrderNum
,OrderDate
,Supplier
,SKU
,Amount
.
To split a single source dataset into two, we will be using the technique described here.
To create the first transformation, which populates the Order table, simply write a Source SQL query which includes an aggregation function, as in the example below:
select "Order Number" OrderNum, "Order Date" as OrderDate, sum(amount) OrderAmount
group by "Order Number"
We are using the group by expression together with SUM(amount).
The following aggregation functions are supported:
SUM(field)
- sum of all aggregated values of the field.MIN(field)
- minimum value of the field.MAX(field)
- maximum value of the field.AVG(field)
- average value of the field.COUNT(*)
- number of records.
The group by expression can include one or more comma-separated field name(s) to group the dataset.
Expressions are not supported as parameters for aggregation functions.
Using set operations
The following set operations are supported when writing SQL extracting data from files, API endpoints, and social websites:
- UNION - the UNION operator is used to combine the result sets of two or more SELECT statements. Unlike UNION operator used when running a query against relational databases, it does not remove duplicate rows between the various SELECT statements, so it works similarly to UNION ALL.
- INTERSECT - the INTERSECT operator is used to return the results of two or more SELECT statements. However, it only returns the rows selected by all queries or data sets. If a record exists in one query and not in the other, it will be omitted from the INTERSECT results.
- MINUS - return all rows in the first SELECT statement that are not returned by the second SELECT statement.
Writing queries for nested data objects.
This is an example where the source is a nested data object returned by a web service.
{
"requestId":"185d6#14b51985ff0",
"success":true,
"result":[
{
"id":"433455454",
"name":"all fields",
"searchableFields":[
{
"name":"field1",
"displayName":"Just field 1",
"dataType":"string",
"length":36,
"updateable":false
},
{
"name":"field2",
"displayName":"Just field 2",
"dataType":"string",
"length":50,
"updateable":false
},
],
"fields":[
{
"name":"leadId",
"displayName":"Lead Id",
"dataType":"integer",
"updateable":false
},
{
"name":"role",
"displayName":"Role",
"dataType":"string",
"length":50,
"updateable":false
},
{
"name":"isPrimary",
"displayName":"Is Primary",
"dataType":"boolean",
"updateable":true
},
{
"name":"externalCreatedDate",
"displayName":"External Created Date",
"dataType":"datetime",
"updateable":true
}
]
}
]
}
The goal is to extract all searchableFields
and fields
from the nested data object and create a flat CSV file.
This can easily be accomplished by using the following SQL query, which is applied to the nested source data object:
select name, displayName, 'searchable' as fieldType from result.searchableFields union select name, displayName, 'regular' as fieldType from result.fields
The additional from clause is of primary importance. When writing a query to be executed on a nested data object, you can specify a level from which the data will be extracted. Use .
as a delimiter to drill down into the nested hierarchy of objects.
If you want to include some of the fields from the parent
object, as well as fields from the nested object
, use the owner.
qualifier:
select result.name as apiName, searchableFields.name, searchableFields.displayName,
'searchable' as fieldType
from result.searchableFields
union
select result.name as apiName, fields.name, fields.displayName,
'regular' as fieldType
from result.fields
Writing queries when nested data objects can be null
In a typical nested JSON or XML, any node can be potentially empty or even does not exist for some rows. Example:
<root>
<row>
<nested_node1>
<field1/>
<field2>abc</field2>
</nested_node1>
<nested_node2>
<field1/>
<field2>xyz</field2>
</nested_node2>
</row>
<row>
<nested_node1>
<field1/>
<field2>abc</field2>
</nested_node1>
</row>
<row>
<nested_node2>
<field1/>
<field2>abc</field2>
</nested_node2>
</row>
</root>
If you will try to retrieve field1 and field2 for the <nested_node1> using SQL below:
select nested_node1.field1,nested_node1.field2 from root.row
or
select nested_node1.fieldFieldValue(0, 'field1'),
nested_node1.fieldFieldValue(0, 'field2')
from root.row
you will get a null pointer exception because for some of the records the <nested_node1> is null (does not exist).
To solve the problem use getNullableFieldValue(...).
select getNullableFieldValue(nested_node1).getFieldValue(0, 'field1'),
getNullableFieldValue(nested_node1).getFieldValue(0, 'field2')
from root.row
If there are multiple levels of nestedness, starting from the first nested field use:
dataSet.getNullableFieldValue(firstNestedField).
getNullableFieldValue(0, "field name in the middle").
getFieldValue(0, "last field name")
The first call to getNullableFieldValue with a single argument will return an empty dataset if the firstNestedField is null.
The second call to getNullableFieldValue with a row number (0) and field-in-a-middle name will do the same if the field-in-a-middle is null.
The last call to getFieldValue(...) will return null if the last-field is null, just like we want it.
Comments
0 comments
Please sign in to leave a comment.