Overview
Etlworks allows you to leverage SQL to query data from non-relational sources such as files (CSV, JSON, XML, Excel, Parquet, Avro and others) and API responses. This feature supports both flat and nested datasets, enabling powerful transformations, filtering, and aggregations using familiar SQL syntax.
How It Works
When querying non-relational datasets:
Virtual Relational Model: The data is transformed into a virtual table-like structure.
- Flat datasets (e.g., CSV, Excel) map directly to columns and rows.
- Nested datasets (e.g., JSON, XML) are hierarchically parsed, and nested nodes are exposed using dot notation (e.g., node1.node2.node3).
SQL Query Engine: A query engine processes the SQL statements, applying filtering, aggregation, and set operations like UNION, INTERSECT, etc.
Dynamic Metadata: Column names and hierarchical relationships are dynamically determined based on the dataset.
Execution Context:
- Data variables: dataSet and currentRecord allow contextual access to the dataset and current record.
- Built-in and user-defined functions can be used for transformations.
Supported Sources
Flat Datasets: CSV, Excel, Avro, Parquet.
Nested Datasets: JSON, XML.
API Responses: JSON, XML, or any supported format.
SQL Capabilities
Column Selection: Select specific fields or nested attributes.
Filtering: Use WHERE clauses for conditions.
Aggregation: Supported functions include COUNT, MAX, MIN, and AVG.
Set Operations: Perform operations like UNION, UNION ALL, MINUS, and INTERSECT.
Hierarchy Navigation:
- Use dot notation to access nested nodes (e.g., address.city).
- Use dataSet.getFieldValue() and dataSet.getFieldValueWithNested() for dynamic access.
SQL Syntax
General Syntax
SELECT column1, column2, nestedNode1.nestedNode2
FROM datasetNode
WHERE condition
GROUP BY column1
ORDER BY column2;
Special Variables
- dataSet: The current dataset in context.
- dataSet.getFieldValue(row, 'column'): Retrieves a column value for a row.
- dataSet.getFieldValueWithNested(row, 'parent.child'): Retrieves nested column values.
- dataSet.getNullableFieldValue(row, 'parent.child'): Retrieves nested values, returning NULL if absent.
- currentRecord: The current row being processed.
- currentRecord.getParentDataSetRecord(): Gets the parent row in a nested dataset.
Order of Columns
The built-in SQL engine follows specific rules for determining the order of columns in the result set. These rules ensure consistent behavior but may not always align with the order specified in the SELECT statement. Here’s how it works:
Rules for Column Ordering:
Columns Existing in the Source:
If a column exists in the source and is included in the SELECT statement without an alias, the engine retains the column’s original position as it appears in the source dataset.
New Columns or Aliases:
If the column is derived, renamed using an alias, or does not exist in the source, the engine places the column in the position defined by the SELECT statement.
Order Examples
Source Dataset:
OrderNum, OrderDate, Amount
Example 1:
SELECT Amount, OrderDate, OrderNum
Result:
The output will maintain the original column order:
OrderNum, OrderDate, Amount
Example 2:
SELECT Amount AS amt, OrderDate AS order_date, OrderNum AS order_num
Result:
The output will follow the column order in the SELECT statement:
amt, order_date, order_num
Explanation:
Since all columns are aliased (renamed), the engine respects the order specified in the SELECT statement.
Examples
Example 1: Select from Flat Dataset
Source (CSV):
firstName,lastName,age
John,Doe,30
Jane,Smith,25
Query:
SELECT firstName, age FROM dataSet WHERE age > 26;
Output:
firstName | age |
John | 30 |
Example 2: Select from Nested Dataset
Source (JSON):
{
"orders": [
{
"orderId": 1,
"price": 250,
"items": [
{ "itemId": "A1", "itemName": "Laptop", "quantity": 1 },
{ "itemId": "A2", "itemName": "Mouse", "quantity": 2 }
]
},
{
"orderId": 2,
"price": 100,
"items": [
{ "itemId": "B1", "itemName": "Keyboard", "quantity": 1 },
{ "itemId": "B2", "itemName": "Monitor", "quantity": 2 }
]
}
]
}
Query:
SELECT orderId, price, itemId, itemName, quantity
FROM orders.items;
Explanation:
orders Level:
- Contains fields such as orderId and price.
items Level:
- Nested within orders and contains fields such as itemId, itemName, and quantity.
orders.items in FROM Clause:
- Specifies the query operates at the items level.
- Columns from orders (e.g., orderId, price) are implicitly available to the query as they are parent-level fields.
Output:
orderId | price | itemId | itemName | quantity |
1 | 250 | A1 | Laptop | 1 |
1 | 250 | A2 | Mouse | 2 |
2 | 100 | B1 | Keyboard | 1 |
3 | 100 | B2 | Monitor | 2 |
Example 3: Access Child Data With a Single Record
When a child node (e.g., address) contains only one record, it can be accessed directly using dot notation (e.g., parent.child). This works because there is no ambiguity regarding which child record to retrieve—there is only one.
If the child node contained multiple records (e.g., an array), the query would need to address this ambiguity, usually by flattening or aggregating the data.
Source (JSON):
{
"data": [
{
"first": "Joe",
"last": "Doe",
"address": {
"street": "123 Elm St",
"city": "Springfield",
"state": "IL",
"zip": "62704"
}
},
{
"first": "Jane",
"last": "Smith",
"address": {
"street": "456 Oak St",
"city": "Lincoln",
"state": "IL",
"zip": "62656"
}
}
]
}
Query:
SELECT first, last, address.street, address.city, address.state, address.zip
FROM data;
Explanation:
Child Node with a Single Record (address):
- The address node contains a single record, not an array.
- Fields like street, city, state, and zip are directly accessible with dot notation.
- Example: address.street, address.city.
Result:
- The query extracts both parent (first, last) and child (address) fields into a flat structure.
Output:
first |
last |
street |
city |
state |
zip |
Joe |
Doe |
123 Elm St |
Springfield |
IL |
62704 |
Jane |
Smith |
456 Oak St |
Lincoln |
IL |
62656 |
Example 4: Aggregation
Source (JSON):
{
"orders": [
{ "orderId": 1, "amount": 100 },
{ "orderId": 2, "amount": 150 },
{ "orderId": 3, "amount": 200 }
]
}
Query:
SELECT COUNT(*) AS totalOrders, AVG(amount) AS avgAmount FROM orders;
Output:
totalOrders | avgAmount |
3 | 150 |
Example 5: Set Operations
Source (JSON):
{
"groupA": [ { "name": "Alice" }, { "name": "Bob" } ],
"groupB": [ { "name": "Bob" }, { "name": "Charlie" } ]
}
Query:
SELECT name FROM groupA
UNION
SELECT name FROM groupB;
Output:
name |
Alice |
Bob |
Charlie |
Special and Edge Cases
Dynamic Column References
Dynamic column references enable users to access deeply nested data levels dynamically by specifying the full path of the column using dot notation. This is particularly useful when dealing with hierarchically structured data where each level has a single record.
The dataSet.getFieldValueWithNested(currentRecord, 'node1.node2.columnName') function allows you to reference columns nested within multiple levels. The currentRecord parameter can either be the literal variable currentRecord or a specific row number (0-based index) within the current driving dataset.
Source (JSON):
{
"company": {
"departments": {
"HR": {
"head": {
"name": "Alice",
"email": "alice@company.com"
},
"location": "Building A"
},
"Engineering": {
"head": {
"name": "Bob",
"email": "bob@company.com"
},
"location": "Building B"
}
}
}
}
Example:
SELECT
dataSet.getFieldValueWithNested(currentRecord, 'company.departments.HR.head.name').getValue() AS hrHeadName,
dataSet.getFieldValueWithNested(currentRecord, 'company.departments.HR.location').getValue() AS hrLocation,
dataSet.getFieldValueWithNested(currentRecord, 'company.departments.Engineering.head.name').getValue() AS engineeringHeadName,
dataSet.getFieldValueWithNested(currentRecord, 'company.departments.Engineering.location').getValue() AS engineeringLocation
FROM company;
Explanation:
Dynamic Access:
- The dataSet.getFieldValueWithNested() function dynamically navigates through nested nodes using dot notation.
- Each node (e.g., company.departments.HR) corresponds to a level in the hierarchy, with the column name (e.g., head.name) specified at the end.
Driving Dataset:
- The currentRecord refers to the current row in the driving dataset (root node company in this case).
- The function extracts values from the nested levels of the JSON.
Output Columns:
- The query generates columns like hrHeadName, hrLocation, engineeringHeadName, and engineeringLocation based on the specified paths.
hrHeadName |
hrLocation |
engineeringHeadName |
engineeringLocation |
Alice |
Building A |
Bob |
Building B |
Selecting From Nested Node and Accessing Data From a Sibling Node Using getParent
In this example, we query data from a nested node (intervals) and access additional information from its sibling node (asset) using getParent. The parent dataset (data) provides contextual information such as the asset’s name and driver ID.
Source (JSON):
[
{
"tripStartTime": "2024-12-17T00:08:13.029Z",
"createdAtTime": "2024-12-17T00:08:51.352Z",
"updatedAtTime": "2024-12-17T00:37:16.026Z",
"asset": {
"id": "281474988523670",
"name": "422660",
"type": "Vehicle"
},
"intervals": [
{
"postedSpeedLimitKilometersPerHour": 40,
"maxSpeedKilometersPerHour": 50.004,
"severityLevel": "moderate",
"startTime": "2024-12-17T00:09:41.032Z",
"endTime": "2024-12-17T00:09:47.129Z",
"address": {
"street": "Anonymized St",
"city": "Anytown",
"state": "XY",
"postalCode": "00000",
"country": "US"
}
}
],
"driverId": "51550209"
},
{
"tripStartTime": "2024-12-17T00:39:09.025Z",
"createdAtTime": "2024-12-17T00:39:28.039Z",
"updatedAtTime": "2024-12-17T03:49:04.002Z",
"asset": {
"id": "281474986252047",
"name": "419439",
"type": "Vehicle"
},
"intervals": [
{
"postedSpeedLimitKilometersPerHour": 105,
"maxSpeedKilometersPerHour": 113.009,
"severityLevel": "moderate",
"startTime": "2024-12-17T00:48:44.028Z",
"endTime": "2024-12-17T00:48:54.03Z",
"address": {
"street": "Anonymized Ave",
"city": "Somewhere",
"state": "YZ",
"postalCode": "11111",
"country": "US"
}
},
}
],
"driverId": "50575550"
}
]
Query:
SELECT
startTime,
endTime,
severityLevel,
maxSpeedKilometersPerHour,
postedSpeedLimitKilometersPerHour,
dataSet.getParentDataSet().getFieldValueWithNested(currentRecord.getParentDataSetRecord(), 'asset.name').getValue() AS tractor_id
FROM intervals;
Explanation:
Parent Data Access: getParentDataSet() is used to retrieve the asset.name and driverId fields from the parent node.
Dynamic Hierarchical Navigation: The intervals node is queried, and fields from its sibling (asset) are dynamically included in the result.
Output:
startTime |
endTime |
severityLevel |
maxSpeedKilometersPerHour |
postedSpeedLimitKilometersPerHour |
tractor_id |
2024-12-17T00:09:41.032Z |
2024-12-17T00:09:47.129Z |
moderate |
50.004 |
40 |
422660 |
2024-12-17T00:48:44.028Z |
2024-12-17T00:48:54.03Z |
moderate |
113.009 |
105 |
419439 |
Missing or Null Nodes
When querying JSON data, there are scenarios where specific nodes or attributes might be missing or null. The SQL engine handles these cases gracefully by returning null for the missing or null data, ensuring no errors are thrown during execution.
Source (JSON):
[
{
"id": 1,
"name": "John Doe",
"address": {
"street": "123 Elm St",
"city": "Springfield",
"state": "IL"
}
},
{
"id": 2,
"name": "Jane Smith",
"address": null
},
{
"id": 3,
name": "Alice Johnson"
}
]
Query:
SELECT
id,
name,
dataSet.getNullableFieldValueWithNested(currentRecord, 'address.street').getValue() AS street,
dataSet.getNullableFieldValueWithNested(currentRecord, 'address.city').getValue() AS city,
dataSet.getNullableFieldValueWithNested(currentRecord, 'address.state').getValue() AS state
Explanation:
Handling Nulls:
- The dataSet.getNullableFieldValueWithNested function is used to retrieve values from potentially missing or null nodes.
- If the node or attribute is null or missing, the function returns null instead of throwing an error.
Dynamic Navigation: The address object is referenced dynamically using nested paths like address.street, address.city, and address.state.
Output:
id |
name |
street |
city |
state |
1 |
John Doe |
123 Elm St |
Springfield |
IL |
2 |
Jane Smith |
null |
null |
null |
3 |
Alice Johnson |
null | null | null |
Detailed Explanation of Output:
First Record: The address node exists with all attributes populated, so street, city, and state are returned as-is.
Second Record: The address node is explicitly set to null. As a result, all columns (street, city, state) return null.
Third Record: The address node is completely missing. The query engine handles this seamlessly, returning null for all missing fields.
Querying Excel Data
When querying data from Excel files, the main difference compared to querying from text files (e.g., CSV) is the need to specify the worksheet name in the FROM clause. Excel data can be referenced using the syntax [worksheet name], where the worksheet name must be enclosed in square brackets.
Source Data (Excel File: Orders.xlsx, Worksheet: Sheet1)
OrderNum |
OrderDate |
Amount |
1001 |
2024-12-01 |
150.00 |
1002 |
2024-12-02 |
200.00 |
1003 |
2024-12-03 |
300.00 |
Query:
SELECT OrderNum, OrderDate, Amount
FROM [Sheet1]
WHERE Amount > 150;
Output:
OrderNum |
OrderDate |
Amount |
1002 |
2024-12-02 |
200.00 |
1003 |
2024-12-03 |
300.00 |
Best Practices
Optimize for Large Datasets:
- Use filters (WHERE) to minimize processing.
Use Parent-Child Relationships:
- Access parent or ancestor records using hierarchical navigation methods.
Combine SQL with JavaScript:
- Extend SQL capabilities by using JavaScript functions or available libraries.
Test for Edge Cases:
- Validate queries for missing data, unexpected formats, and deeply nested structures.
Testing SQL queries in Explorer
Step 1. Select Explorer
Step 2. Expand nested data object
Step 3. Click Develop SQL
Step 4. Enter SQL in the editor
Step 5. Click Execute SQL.
Comments
0 comments
Please sign in to leave a comment.