Overview

Etlworks allows you to leverage SQL to query data from non-relational sources such as files (CSV, JSON, XML, Excel, Parquet, Avro and others) and API responses. This feature supports both flat and nested datasets, enabling transformations, filtering, and aggregations using familiar SQL syntax.

How It Works

When querying non-relational datasets:

Virtual Relational Model: The data is transformed into a virtual table-like structure.

Flat datasets (e.g., CSV, Excel) map directly to columns and rows.
Nested datasets (e.g., JSON, XML) are hierarchically parsed, and nested nodes are exposed using dot notation (e.g., node1.node2.node3).

SQL Query Engine: A query engine processes the SQL statements, applying filtering, aggregation, and set operations like UNION, INTERSECT, etc.

Dynamic Metadata: Column names and hierarchical relationships are dynamically determined based on the dataset.

Execution Context:

Data variables: dataSet and currentRecord allow contextual access to the dataset and current record.
Built-in and user-defined functions can be used for transformations.

Supported Sources

Flat Datasets: CSV, Excel, Avro, Parquet.

Nested Datasets: JSON, XML.

API Responses: JSON, XML, or any supported format.

SQL Capabilities

Column Selection: Select specific fields or nested attributes.

Filtering: Use WHERE clauses for conditions.

Aggregation: Supported functions include COUNT, MAX, MIN, and AVG.

Set Operations: Perform operations like UNION, UNION ALL, MINUS, and INTERSECT.

Hierarchy Navigation:

Use dot notation to access nested nodes (e.g., address.city).
Use dataSet.getFieldValue and dataSet.getFieldValueWithNested for dynamic access.

SQL Syntax

General Syntax

SELECT column1, column2, nestedNode1.nestedNode2
FROM datasetNode
WHERE condition
GROUP BY column1
ORDER BY column2;

Special Variables

dataSet: The current dataset in context.
dataSet.getFieldValue(row, 'column'): Retrieves a column value for a row.
dataSet.getFieldValueWithNested(row, 'parent.child'): Retrieves nested column values.
dataSet.getNullableFieldValue(row, 'parent.child'): Retrieves nested values, returning NULL if absent.
currentRecord: The current row being processed.
currentRecord.getParentDataSetRecord: Gets the parent row in a nested dataset.

Order of Columns

The built-in SQL engine follows specific rules for determining the order of columns in the result set. These rules ensure consistent behavior but may not always align with the order specified in the SELECT statement. Here’s how it works:

Rules for Column Ordering:

Columns Existing in the Source:

If a column exists in the source and is included in the SELECT statement without an alias, the engine retains the column’s original position as it appears in the source dataset.

New Columns or Aliases:

If the column is derived, renamed using an alias, or does not exist in the source, the engine places the column in the position defined by the SELECT statement.

Order Examples

Source Dataset:

OrderNum, OrderDate, Amount

Example 1:

SELECT Amount, OrderDate, OrderNum

Result:

The output will maintain the original column order:

OrderNum, OrderDate, Amount

Example 2:

SELECT Amount AS amt, OrderDate AS order_date, OrderNum AS order_num

Result:

The output will follow the column order in the SELECT statement:

amt, order_date, order_num

Explanation:

Since all columns are aliased (renamed), the engine respects the order specified in the SELECT statement.

Examples

Example 1: Select from Flat Dataset

Source (CSV):

firstName,lastName,age
John,Doe,30
Jane,Smith,25

Query:

SELECT firstName, age FROM dataSet WHERE age > 26;

Output:

firstName	age
John	30

Example 2: Select from Nested Dataset

Source (JSON):

{
"orders": [
   {
     "orderId": 1,
     "price": 250,
     "items": [
      { "itemId": "A1", "itemName": "Laptop", "quantity": 1 },
      { "itemId": "A2", "itemName": "Mouse", "quantity": 2 }
     ]
   },
   {
     "orderId": 2,
     "price": 100,
     "items": [
       { "itemId": "B1", "itemName": "Keyboard", "quantity": 1 },
       { "itemId": "B2", "itemName": "Monitor", "quantity": 2 }
     ]
   }
  ]
}

Query:

SELECT orderId, price, itemId, itemName, quantity
FROM orders.items;

Explanation:

orders Level:

Contains fields such as orderId and price.

items Level:

Nested within orders and contains fields such as itemId, itemName, and quantity.

orders.items in FROM Clause:

Specifies the query operates at the items level.
Columns from orders (e.g., orderId, price) are implicitly available to the query as they are parent-level fields.

Output:

orderId	price	itemId	itemName	quantity
1	250	A1	Laptop	1
1	250	A2	Mouse	2
2	100	B1	Keyboard	1
3	100	B2	Monitor	2

Example 3: Access Child Data With a Single Record

When a child node (e.g., address) contains only one record, it can be accessed directly using dot notation (e.g., parent.child). This works because there is no ambiguity regarding which child record to retrieve—there is only one.

If the child node contained multiple records (e.g., an array), the query would need to address this ambiguity, usually by flattening or aggregating the data.

Source (JSON):

{
"data": [
   {
     "first": "Joe",
     "last": "Doe",
     "address": {
       "street": "123 Elm St",
       "city": "Springfield",
       "state": "IL",
       "zip": "62704"
     }
   },
   {
     "first": "Jane",
     "last": "Smith",
     "address": {
        "street": "456 Oak St",
        "city": "Lincoln",
        "state": "IL",
        "zip": "62656"
     }
   }
 ]
}

Query:

SELECT first, last, address.street, address.city, address.state, address.zip
FROM data;

Explanation:

Child Node with a Single Record (address):

The address node contains a single record, not an array.
Fields like street, city, state, and zip are directly accessible with dot notation.
Example: address.street, address.city.

Result:

The query extracts both parent (first, last) and child (address) fields into a flat structure.

Output:

first	last	street	city	state	zip
Joe	Doe	123 Elm St	Springfield	IL	62704
Jane	Smith	456 Oak St	Lincoln	IL	62656

Example 4: Aggregation

Source (JSON):

{
  "orders": [
     { "orderId": 1, "amount": 100 },
     { "orderId": 2, "amount": 150 },
     { "orderId": 3, "amount": 200 }
  ]
}

Query:

SELECT COUNT(*) AS totalOrders, AVG(amount) AS avgAmount FROM orders;

Output:

totalOrders	avgAmount
3	150

Example 5: Set Operations

Source (JSON):

{
   "groupA": [ { "name": "Alice" }, { "name": "Bob" } ],
   "groupB": [ { "name": "Bob" }, { "name": "Charlie" } ]
}

Query:

SELECT name FROM groupA
UNION
SELECT name FROM groupB;

Output:

name
Alice
Bob
Charlie

Special and Edge Cases

Dynamic Column References

Dynamic column references enable users to access deeply nested data levels dynamically by specifying the full path of the column using dot notation. This is particularly useful when dealing with hierarchically structured data where each level has a single record.

The dataSet.getFieldValueWithNested(currentRecord, 'node1.node2.columnName') function allows you to reference columns nested within multiple levels. The currentRecord parameter can either be the literal variable currentRecord or a specific row number (0-based index) within the current driving dataset.

Source (JSON):

{
   "company": {
     "departments": {
       "HR": {
         "head": {
            "name": "Alice",
            "email": "alice@company.com"
         },
       "location": "Building A"
     },
       "Engineering": {
          "head": {
             "name": "Bob",
             "email": "bob@company.com"
          },
       "location": "Building B"
     }
   }
  }
}

Example:

SELECT 
    dataSet.getFieldValueWithNested(currentRecord, 'company.departments.HR.head.name').getValue() AS hrHeadName,
    dataSet.getFieldValueWithNested(currentRecord, 'company.departments.HR.location').getValue() AS hrLocation,
    dataSet.getFieldValueWithNested(currentRecord, 'company.departments.Engineering.head.name').getValue() AS engineeringHeadName,
    dataSet.getFieldValueWithNested(currentRecord, 'company.departments.Engineering.location').getValue() AS engineeringLocation
FROM company;

Explanation:

Dynamic Access:

The dataSet.getFieldValueWithNested function dynamically navigates through nested nodes using dot notation.
Each node (e.g., company.departments.HR) corresponds to a level in the hierarchy, with the column name (e.g., head.name) specified at the end.

Driving Dataset:

The currentRecord refers to the current row in the driving dataset (root node company in this case).
The function extracts values from the nested levels of the JSON.

Output Columns:

The query generates columns like hrHeadName, hrLocation, engineeringHeadName, and engineeringLocation based on the specified paths.

hrHeadName	hrLocation	engineeringHeadName	engineeringLocation
Alice	Building A	Bob	Building B

Selecting From Nested Node and Accessing Data From a Sibling Node Using getParent

In this example, we query data from a nested node (intervals) and access additional information from its sibling node (asset) using getParent. The parent dataset (data) provides contextual information such as the asset’s name and driver ID.

Source (JSON):

[
  {
   "tripStartTime": "2024-12-17T00:08:13.029Z",
   "createdAtTime": "2024-12-17T00:08:51.352Z",
   "updatedAtTime": "2024-12-17T00:37:16.026Z",
   "asset": {
     "id": "281474988523670",
     "name": "422660",
     "type": "Vehicle"
   },
   "intervals": [
      {
        "postedSpeedLimitKilometersPerHour": 40,
        "maxSpeedKilometersPerHour": 50.004,
        "severityLevel": "moderate",
        "startTime": "2024-12-17T00:09:41.032Z",
        "endTime": "2024-12-17T00:09:47.129Z",
        "address": {
           "street": "Anonymized St",
           "city": "Anytown",
           "state": "XY",
           "postalCode": "00000",
           "country": "US"
         }
      }
   ],
   "driverId": "51550209"
   },
   {
      "tripStartTime": "2024-12-17T00:39:09.025Z",
      "createdAtTime": "2024-12-17T00:39:28.039Z",
      "updatedAtTime": "2024-12-17T03:49:04.002Z",
      "asset": {
         "id": "281474986252047",
         "name": "419439",
         "type": "Vehicle"
      },
      "intervals": [
         {
            "postedSpeedLimitKilometersPerHour": 105,
            "maxSpeedKilometersPerHour": 113.009,
            "severityLevel": "moderate",
            "startTime": "2024-12-17T00:48:44.028Z",
            "endTime": "2024-12-17T00:48:54.03Z",
            "address": {
              "street": "Anonymized Ave",
              "city": "Somewhere",
              "state": "YZ",
              "postalCode": "11111",
              "country": "US"
             }
         },
    }
   ],
   "driverId": "50575550"
  }
]

Query:

SELECT
   startTime,
   endTime,
   severityLevel,
   maxSpeedKilometersPerHour,
   postedSpeedLimitKilometersPerHour,
   dataSet.getParentDataSet().getFieldValueWithNested(currentRecord.getParentDataSetRecord(), 'asset.name').getValue() AS tractor_id
FROM intervals;

Explanation:

Parent Data Access: getParentDataSet is used to retrieve the asset.name and driverId fields from the parent node.

Dynamic Hierarchical Navigation: The intervals node is queried, and fields from its sibling (asset) are dynamically included in the result.

Output:

startTime	endTime	severityLevel	maxSpeedKilometersPerHour	postedSpeedLimitKilometersPerHour	tractor_id
2024-12-17T00:09:41.032Z	2024-12-17T00:09:47.129Z	moderate	50.004	40	422660
2024-12-17T00:48:44.028Z	2024-12-17T00:48:54.03Z	moderate	113.009	105	419439

Missing or Null Nodes

When querying JSON data, there are scenarios where specific nodes or attributes might be missing or null. The SQL engine handles these cases gracefully by returning null for the missing or null data, ensuring no errors are thrown during execution.

Source (JSON):

[
  {
    "id": 1,
    "name": "John Doe",
    "address": {
    "street": "123 Elm St",
      "city": "Springfield",
      "state": "IL"
     }
  },
  {
    "id": 2,
    "name": "Jane Smith",
    "address": null
  },
  {
    "id": 3,
    name": "Alice Johnson"
  }
]

Query:

SELECT
id,
name,
dataSet.getNullableFieldValueWithNested(currentRecord, 'address.street').getValue() AS street,
dataSet.getNullableFieldValueWithNested(currentRecord, 'address.city').getValue() AS city,
dataSet.getNullableFieldValueWithNested(currentRecord, 'address.state').getValue() AS state

Explanation:

Handling Nulls:

The dataSet.getNullableFieldValueWithNested function is used to retrieve values from potentially missing or null nodes.
If the node or attribute is null or missing, the function returns null instead of throwing an error.

Dynamic Navigation: The address object is referenced dynamically using nested paths like address.street, address.city, and address.state.

Output:

id	name	street	city	state
1	John Doe	123 Elm St	Springfield	IL
2	Jane Smith	null	null	null
3	Alice Johnson	null	null	null

Detailed Explanation of Output:

First Record: The address node exists with all attributes populated, so street, city, and state are returned as-is.

Second Record: The address node is explicitly set to null. As a result, all columns (street, city, state) return null.

Third Record: The address node is completely missing. The query engine handles this, returning null for all missing fields.

Querying Excel Data

When querying data from Excel files, the main difference compared to querying from text files (e.g., CSV) is the need to specify the worksheet name in the FROM clause. Excel data can be referenced using the syntax [worksheet name], where the worksheet name must be enclosed in square brackets.

Source Data (Excel File: Orders.xlsx, Worksheet: Sheet1)

OrderNum	OrderDate	Amount
1001	2024-12-01	150.00
1002	2024-12-02	200.00
1003	2024-12-03	300.00

Query:

SELECT OrderNum, OrderDate, Amount
FROM [Sheet1]
WHERE Amount > 150;

Output:

OrderNum	OrderDate	Amount
1002	2024-12-02	200.00
1003	2024-12-03	300.00

Best Practices

Optimize for Large Datasets:

Use filters (WHERE) to minimize processing.

Use Parent-Child Relationships:

Access parent or ancestor records using hierarchical navigation methods.

Combine SQL with JavaScript:

Extend SQL capabilities by using JavaScript functions or available libraries.

Test for Edge Cases:

Validate queries for missing data, unexpected formats, and deeply nested structures.

Testing SQL queries in Explorer

Step 1. Select Explorer

Step 2. Expand nested data object

Step 3. Click Develop SQL

Step 4. Enter SQL in the editor

Step 5. Click Execute SQL.

Articles in this section

Use SQL to extract data from non-relational sources

Overview

How It Works

Supported Sources

SQL Capabilities

SQL Syntax

General Syntax

Special Variables

Order of Columns

Rules for Column Ordering:

Order Examples

Examples

Example 1: Select from Flat Dataset

Example 2: Select from Nested Dataset

Example 3: Access Child Data With a Single Record

Example 4: Aggregation

Example 5: Set Operations

Special and Edge Cases

Dynamic Column References

Selecting From Nested Node and Accessing Data From a Sibling Node Using getParent

Missing or Null Nodes

Querying Excel Data

Best Practices

Testing SQL queries in Explorer

Articles in this section

Overview

How It Works

Supported Sources

SQL Capabilities

SQL Syntax

General Syntax

Special Variables

Order of Columns

Rules for Column Ordering:

Order Examples

Examples

Example 1: Select from Flat Dataset

Example 2: Select from Nested Dataset

Example 3: Access Child Data With a Single Record

Example 4: Aggregation

Example 5: Set Operations

Special and Edge Cases

Dynamic Column References

Selecting From Nested Node and Accessing Data From a Sibling Node Using getParent

Missing or Null Nodes

Querying Excel Data

Best Practices

Testing SQL queries in Explorer

Related articles