Overview
One of the common ETL operations, typically performed to calculate a field's value, is a lookup
. Given the list of input parameters, the system should be able to return a field's value or the entire dataset, by querying a database or other data source. For example, if the input is a user_id
, the system should be able to return all the phones that belong to a user with a given ID.
There are multiple options for lookup:
- Extract and Lookup
- Lookup using an existing dataset and a SQL
- Lookup using an existing dataset and filter conditions
Extract and lookup
Use this type of lookup if you want to extract data from a given Connection, apply SQL, and return a field's value or a dataset.
It is recommended that you use this type of lookup if the data for a lookup is stored in a relational database.
For example, let's assume that there are two tables in two different databases: movies
and directors
. The movies
table has an ID
of the director (director_id
), but the actual attributes of the director, such as first name, last name, etc., are stored in the directors
table in a different database. The Flow must combine the movies and the directors and load data into a single table in the data warehouse.
When creating a Flow, you can use the movies
table as a source, add the calculated fields director_first_name
and director_last_name
where the value will be retrieved using a lookup
method, which takes a Connection to the database with a directors
table as an input, applies SQL query to find a director for a given director_id
, and returns the name of the director.
There are three methods for this type of lookup:
- Calculate the field's value using SQL
- Lookup which returns a field's value
- Lookup which returns a dataset
Calculate the field's value using SQL
You can use SQL directly (without JavaScript) to calculate the field's value.
This technique is available only in calculated fields. The lookup SQL is executed for each row extracted from the source, so when working with larger data sets, consider using Source query
instead.
Let's assume that the current record has a field director_id.
This example demonstrates how to lookup a director's first name
in the directors
table in the source database using given director_id
:
Step 1. Add a calculated field called first name
to the Mapping.
Step 2. Click the pen
button which opens the Field Value Function
editor, and type in or paste the select SQL statement. The SQL statement can return more than one field, but the system will either use a field with the same name as the name of the calculated field or (if there are no matching names) the first field returned by the SELECT
statement.
Note token {director_id}
which is used as a reference to the lookup field director_id
. You can use more than one lookup field in SQL.
Lookup data in the destination
The example above executes SQL in the source (FROM
) Connection. What if the data you are trying to look up is in the destination?
Use the destination.
prefix in the select statement to lookup data in the destination.
destination.select first_name from directors where id = {director_id}
Lookup data somewhere else (not in the source and not in the destination)
Use connection_name.
prefix in the select statement to lookup data using named Connection.
the Connection with the specific name must be either used in one of the transformations in the Flow or must be added as a named Connection to the nested Flow.
Example, if the Connection is named temp
:
temp.select first_name from directors where id = {director_id}
Using the in-memory database to improve performance
To improve the performance of the SQL lookup, you can load data into the in-memory temporary tables and use the Connection to the temporary database to lookup the data. Here's how you can do this:
Step 1. Create a Connection to the temporary database. Name it, for example, temp
.
Step 2. Create a source-to-destination Flow, where the destination is a temporary database, and the source is the original source for dictionaries: another database, file, API, etc.
Step 3. Add one or multiple source-to-destination transformations to preload dictionaries into the temporary database. Note that the dictionary tables will be created on the fly from the source.
Step 4. Continue using the Flow created in step 2 and add actual transformations to move data to the final destination.
Step 5. Add calculated fields and use SQL referencing temp database to lookup the data in dictionaries. Example:
temp.select first_name from directors where id = {director_id}
Extract and Lookup which returns a field's value
This type of lookup extracts data from the specific Connection, applies SQL, and returns the field's value.
Example of Extract and Lookup which returns a field's value
An example of populating a director's first name from the directors
table in the different database (not the same as the source database) for a row with a given director_id
:
value = com.toolsverse.etl.core.engine.Extractor.lookup(etlConfig, scenario,
"connection for a database with directors",
"directors",
"first_name",
"select first_name from directors where id = {ID}",
new com.toolsverse.util.TypedKeyValue("ID", dataSet.getFieldValue(currentRow,
'director_id')));
Signature of the method that performs Extract and Lookup which returns a field's value
The signature of the Object Extractor.lookup
with parameters:
Extractor.lookup(etlConfig, scenario, source-connection-name,
dataset-or-file-name,
field-name, sql, key-value-parameters)
The signature of the Object Extractor.lookup
without parameters:
Extractor.lookup(etlConfig, scenario, source-connection-name,
dataset-or-file-name,
field-name, sql)
When the source-connection
passed as a parameter, the Object Extract.lookup
is not a database Connection, the system uses the cache to improve the performance of the consecutive lookup calls with the same parameters. If you expect that data will be changing between the calls, use Object Extract.lookupNoCache
instead. It has the same parameters as Object Extract.lookup
but as the name says, it does not use the cache.
Looking at the example above:
- the
source-connection-name
isconnection for a database with directors
. - the
dataset-or-file-name
isdirectors
. - the
field-name
islast_name
. - the SQL is
select first_name from directors where id = {ID}
.
The SQL contains the token {ID}
.
- the parameter is:
new TypedKeyValue("ID", dataSet.getFieldValue(currentRow, 'director_id'))
.
ID
is a parameter name. There can be multiple comma-separated parameters; each starts by instantiating a new TypedKeyValue
value, with the name being followed by the value. Parameters are optional.
Required packages for the method that performs Extract and Lookup which returns a field's value
com.toolsverse.etl.core.engin, com.toolsverse.util
Extract and Lookup which returns a dataset
This type of lookup extracts data from the specific Connection, applies SQL, and returns the DataSet.
Example of Extract and Lookup which returns a dataset
An example of populating all images from two linked database tables for a row with a given id:
var images = com.toolsverse.etl.core.engine.Extractor.
lookup(etlConfig, scenario, "source_connection", "image",
"select filename as url,
row_number() OVER () as id
from property_pictures
inner join property_picture on
(property_pictures.nestedrowid =property_picture.nestedparentrowid)
where property_pictures.nestedparentrowid={ROWID}",
new com.toolsverse.util.TypedKeyValue("ROWID", dataSet.getFieldValue(currentRow,
'nestedrowid')));
Signature of the method that performs Extract and Lookup which returns a dataset
The signature of the DataSet Extractor.lookup
with parameters:
Extractor.lookup(etlConfig, scenario, source-connection-name,
dataset-or-file-name, sql, key-value-parameters)
The signature of the DataSet Extractor.lookup
without parameters:
Extractor.lookup(etlConfig, scenario, source-connection-name,
dataset-or-file-name, sql)
When the source-connection
passed as a parameter to the DataSet and Extract.lookup
is not a database Connection, the system uses the cache to improve the performance of the consecutive lookup calls with the same parameters. If you expect that data will be changing between the calls, use DataSet Extract.lookupNoCache
instead. It has the same parameters as DataSet Extract.lookup
but as the name says, it does not use the cache.
Looking at the example above:
- the
source-connection-name
issource_connection
. - the
dataset-or-file-name
isimage
. - the SQL is
sselect filename as URL
.
The SQL contains the token {ROWID}
.
- the parameter is:
new TypedKeyValue("ROWID", dataSet.getFieldValue(currentRow, 'nestedrowid'))
.
ROWID
is a parameter name. There can be multiple comma-separated parameters; each starts by instantiating a new TypedKeyValue
value, with the name being followed by the value. Parameters are optional.
Required packages for the method that performs Extract and Lookup which returns a dataset
com.toolsverse.etl.core.engin, com.toolsverse.util
Lookup in an existing dataset using SQL
Use this type of lookup if you want to extract data from a given data set, apply SQL, and return a field's value or a data set.
It is recommended that you use this type of lookup if the data for a lookup is stored in a nested JSON, XML document, or returned by a web service. Read how to use a memory Connection to store the data set to lookup data in.
This type of lookup is slower than the one which uses filter conditions but can be used with nested data sets.
For example, let's assume that there are two nested JSON files: movies.json
and directors.json
. The movies.json
has a director_id
of the director, but the actual attributes of the director, such as first name, last name, etc., are stored in the directors.json
. The Flow must combine the movies and the directors and load data into the single database table in the data warehouse.
When creating a Flow you can load directors.json
in memory, then use movies.json
as a source and create calculated fields director_first_name
and director_last_name
where the value will be retrieved using a lookup method which takes a data set directors.json
as an input, applies SQL query to find a director for a given director_id
, and returns the name of the director.
There are two methods for this type of lookup:
Lookup in an existing dataset which returns a field's value
Example of the Lookup in an existing dataset which returns a field's value
An example of populating a director's last name from the previously loaded directors.json
file for a row with a given director_id:
value = com.toolsverse.etl.common.CommonEtlUtils.getFieldValue(scenario.getSources().
get("DIRECTORS").getDataSet(),
"select last_name where id = {ID}", "last_name",
new com.toolsverse.util.TypedKeyValue("ID", dataSet.getFieldValue(currentRow,
'director_id')));
Signature of the method that performs Lookup in an existing dataset which returns a field's value
The signature of the Object CommonEtlUtils.getFieldValue
with parameters:
CommonEtlUtils.getFieldValue
(data-set, sql, field-name, key-value-parameters)
The signature of the Object CommonEtlUtils.getFieldValue
without parameters:
CommonEtlUtils.getFieldValue
(data-set, sql, field-name)
Looking at the example above:
- the
data-set
isscenario.getSources().get("DIRECTORS").getDataSet()
. - the SQL is
select last_name where id = {ID}
. - the
field-name
islast_name
.
The SQL contains the token {ID}
.
- the parameter is:
new TypedKeyValue("ID", dataSet.getFieldValue(currentRow, 'director_id'))
.
ID
is a parameter name. There can be multiple comma-separated parameters; each starts by instantiating a new TypedKeyValue value, with the name being followed by the value. Parameters are optional.
Required packages for the method that performs Lookup in an existing dataset which returns a field's value
com.toolsverse.etl.common, com.toolsverse.etl.core.engine,com.toolsverse.util
Lookup in an existing data set which returns a data set
Example of the Lookup in an existing data set which returns a data set
An example of populating all phones from the previously loaded phones.json
file for the given area code:
var phones = com.toolsverse.etl.common.CommonEtlUtils.
executeSql(scenario.getSources().get("PHONES").
getDataSet(),
"select * where area_code=412");
Signature of the method that performs Lookup in an existing data set which returns a data set
The signature of the DataSet CommonEtlUtils.executeSql
with parameters:
CommonEtlUtils.executeSql
(data-set, sql, key-value-parameters)
The signature of the DataSet CommonEtlUtils.executeSql
without parameters:
CommonEtlUtils.executeSql
(data-set, sql)
Looking at the example above:
- the data-set is
scenario.getSources().get("PHONES").getDataSet()
. - the SQL is
select * where area_code=412
.
Required packages for the method that performs Lookup in an existing data set which returns a data set
com.toolsverse.etl.commo, com.toolsverse.util
Lookup in an existing data set using filter
Use this type of lookup if you want to extract data from a given data set, apply filter conditions, and return a field's value or a data set.
It is recommended that you use this type of lookup if the data for a lookup is stored in a flat CSV or Excel document. Read how to use a memory Connection to store the data set to lookup data in.
This type of lookup is faster than the one which uses SQL but can be used only with flat data sets.
For example, let's assume that there are two flat CSV files: movies.csv
and directors.csv
. The movies.csv
has a director_id
of the director, but the actual attributes of the director, such as first name, last name, etc., are stored in the directors.csv
. The Flow must combine the movies and the directors and load data in the single database table.
When creating a Flow, you can load directors.csv
in memory, then use movies.csv
as a source, and create calculated fields director_first_name
and director_last_name
where the value will be retrieved using a lookup method which takes a data set directors.csv
as an input, applies query conditions to find a director for a given director_id
, and returns the name of the director.
There are two methods for this type of lookup:
Filter Lookup in an existing data set which returns a field's value
This type of lookup extracts data from the existing data set, applies filter, and returns the field's value.
Example of the Filter Lookup in an existing data set which returns a field's value
An example of populating a director's last name from the previously loaded directors.csv
file for a row with a given director_id
:
value = com.toolsverse.etl.common.CommonEtlUtils.lookup(scenario.getSources().
get("DIRECTORS").
getDataSet(),
"id=" + {director_id}, "last_name");
Signature of the method that performs Filter Lookup in an existing data set which returns a field's value
The signature of the Object CommonEtlUtils.lookup
:
CommonEtlUtils.lookup
(data-set, condition, field-name)
Looking at the example above:
- the
data-set
isscenario.getSources().get("DIRECTORS").getDataSet()
. - the
condition
isid=" + {director_id}
. - the
field-name
islast_name
.
The condition contains the token {director_id}
which is automatically substituted on the value of the director_id
field in the source (movies.csv
).
Required packages for the method that performs Filter Lookup in an existing data set which returns a field's value
com.toolsverse.etl.common, com.toolsverse.util
Lookup in an existing dataset which returns a dataset, filter output
This type of lookup extracts data from the existing data set, applies filter, and returns the DataSet.
Example of the Filter Lookup in an existing dataset which returns a dataset
An example of populating all phones from the previously loaded phones.csv
file for the given area code:
var phones = com.toolsverse.etl.common.CommonEtlUtils.
filter(scenario.getSources().get("PHONES").
getDataSet(),
"area_code=412");
Signature of the method that performs Filter Lookup in an existing dataset which returns a dataset
The signature of the DataSet CommonEtlUtils.filter
:
CommonEtlUtils.filter
(data-set, condition)
Looking at the example above:
- the data-set is
scenario.getSources().get("PHONES").getDataSet()
- the condition is
area_code = 412
Required packages for the method that performs Filter Lookup in an existing dataset which returns a dataset
com.toolsverse.etl.common, com.toolsverse.util
Comments
0 comments
Please sign in to leave a comment.