Problem
A major media company receives metadata for movies and shows from the partners as Excel files. The metadata includes information about the movie/show itself and technical data, such as the number of audio tracks, encoding, bit rates, etc. Except for a single required ID field, all other columns are optional, and the list of columns is different in each spreadsheet. The objective of this project was to create two Flows: a Flow to upload metadata into the common storage and a Flow to query the metadata using REST API.
Requirements
- The solution must be able to upload data from Excel (
*.xls
) spreadsheets in FTP into the common storage. - The solution must be able to update existing data in the storage and create new data using an ID column as a unique identifier.
- The solution should be able to handle spreadsheets with different numbers of columns.
- There should be a REST API to query stored metadata. The API endpoint must return a JSON array of elements, each containing a superset of all fields.
- The API must support a SQL-like expression as a query parameter to filter out metadata based on conditions.
- If there is no query parameter, the API should return all metadata stored in MongoDB.
Solution
Etlworks Integrator developed two Flows for the customer:
- A Flow that uploads Excel spreadsheets matching a wildcard file name located in FTP into the MongoDB database in the cloud. When extracting each record from the spreadsheet, the Flow creates a new or updates the existing document in MongoDB. The ID column is used as a part of the unique document key. This Flow also moves processed Excel files into the bucket in Amazon S3.
- A Flow deployed as user-defined PULL API to query the metadata stored in MongoDB. The API endpoint includes an optional query parameter that accepts URL-encoded SQL query.
Data upload Flow
This Flow is used to upload metadata from the Excel spreadsheets into MongoDB.
Step 1. Create an FTP Connection for Excel files.
Step 2. Create an Amazon S3 Connection for processed Excel files.
Step 3. Create a Connection to the MongoDB. Select Streaming MongoDB
from the gallery.
Step 4. Create an Excel Format for source Excel spreadsheets.
Step 5. Create a JSON Format. Set the First Object Type
to variable
.
Step 6. Start creating a source-to-destination Flow by selecting the Flow type File to file
from the gallery.
Step 7. Continue by adding a source-to-destination transformation where the source is a Connection created in step 1 and Format created in step 4, and the destination is a Connection created in step 3 and a Format created in step 5.
Step 8. Enter a wildcard file name, for example, *.xls
in the FROM
field.
Step 9. Click MAPPING
, select Parameters
tab, and select (check) the following parameters:
Ignore when there are is no file
Ignore when there are no records
Process all files
Stream Data
Step 10. Click the Additional Transformations
tab and copy the following code into the field Transformation rules for each flow
:
currentRow.setUniqueId('metadata_' + dataSet.getFieldValue(currentRow, 'ID'));
This code sets the unique ID for each record extracted from the Excel worksheet. It will be later used to create or update a MongoDB document.
Step 11. Create Flow to move loaded Excel files into Amazon S3.
Step 12. Combine Flows in the nested Flow.
Step 13. Schedule nested Flow created in step 12 to be executed periodically.
API
This Flow is deployed as a REST API endpoint and used to query metadata from MongoDB.
Step 1. Create a Connection to the MongoDB. Select MongoDB
from the gallery.
Step 2. Create GET HTTP Listener.
Step 3. Create JSON Format with all default settings.
Step 4. Create a new Flow by selecting Flow type File to Web Service
from the gallery.
Step 5. Add source-to-destination transformation where:
- the
FROM
Connection and Format are a Connection created in step 1, and JSON Format created for the upload Flow. - the
FROM
is a wildcard name of the MongoDB document. - the
TO
Connection and Format is the Listener created in step 2 and the Format created in step 3.
Step 6. Click MAPPING
and enter {query}
in the Source query
field.
The query parameter is provided as a part of the URL for the API endpoint: https://app.etlworks.com/plugins/schedules/rest/v1/httplistener/partner/upload?query=select..
.
According to the requirements, if there is no query parameter, the API should return all metadata stored in MongoDB. Since the Source query is set to {query}
, it is expected that the query parameter is provided.
When parsing the parameters in the user-defined APIs, the system automatically creates Flow Variables. The values of the Flow variables are used to substitute {tokens}
in the Source query
. The following technique is used to create a Flow variable query
if the query parameter has not been provided as a part of the URL.
Step 7. Create nested Flow and add Flow created in steps 4 to 6.
Step 8. Click edit condition
and copy the following code into the Condition
field:
var query = scenario.getVariable('query');
if (query == null) {
query = new com.toolsverse.etl.common.Variable();
query.setName('query');
query.setValue('');
scenario.addVariable(query);
}
value = true;
Step 9. Schedule Flow created in steps 7 and 8 to be executed on demand.
Comments
0 comments
Please sign in to leave a comment.