When to use this Flow type
If you already have CSV files somewhere in an attached files system (server storage) and don't need to transform the data, the most efficient way to load these files into Oracle is to use the Flow type Bulk load files into Oracle using sql*loader
described in this article.
Use this flow type
- When you simply want to load CSV files that already exist in the server storage without applying any transformations or transferring files over the network.
- When you want to load CSV files created by the CDC Flow.
Flows optimized for bulk-load into Oracle
Flow type | When to use | |
|
|
When you need to extract data from any source, transform it and load it into Oracle using sql*loader utility. |
Bulk load files into Oracle | You are here | When you need to bulk-load files that already exist in server storage without applying any transformations. The flow automatically loads data into staging tables and MERGEs data into the destination. |
How it works
- The Flow reads the names of all files matching the wildcard in the specific server storage location. It can traverse the subfolders as well.
- The Flow calculates the destination table names based on the source file names.
- The Flow creates multiple threads, one per destination table but not more than the configurable threshold.
- The flow creates control file and executes Oracle sql*loader utility to load data into the staging table in Oracle –– one staging table per actual destination table.
- The Flow mergers data in the staging tables with data in the actual tables.
Features
The Flow supports INSERT
, MERGE
(UPSERT), and CDC MERGE
. With CDC
MERGE
, the Flow updates the actual destination table by applying INSERT
/UPDATE
/DELETE
events in the same order as they originated in the source database. It performs it in the most efficient way possible.
Other Features | Description |
Monitor source schema changes | The Flow is able to CREATE and ALTER destination tables to keep up with schema changes in the source. |
Load CSV files | The flow can load CSV files into Oracle using sql*loader utility |
Delete loaded source files | The Flow can be configured to delete successfully loaded source files. |
Load data in parallel | The Flow is able to load data into multiple destination tables in parallel. |
Process files in subfolders | The Flow is able to process files in multiple nested subfolders. |
Prerequisites
The sql*loader utility must be installed on the same VM as Etlworks. Contact Etlworks support support@etlworks.com
if you need assistance installing the sql*loader.
Install and configure sql*loader
Read how to install and configure Oracle client which includes the sql*loader utility.
Process
Step 1. Create a new Server Storage Connection
This Connection will be used as a source to access the data files. Read more about the Server Storage connector.
Step 2. Create a new Oracle Connection
Oracle Connection will be used as a destination. We recommend using Oracle Service name connector.
Step 3. Create a new CSV Format
This Format will be used to generate a control file for sql*loader.
When configuring the CSV format set date/time format for dates and timestamps stored in the file. The most commonly used are: yyyy-MM-dd HH:mm:ss
or yyyy-MM-dd HH:mm:ss.SSS
.
Other parameters depend on the actual format of the CSV files that you want to load. Here are the most common settings:
Step 4. Create a new Flow
This Flow will be used to load files in server storage into Oracle.
In Flows
click [+]
, type in bulk load files into Oracle
, and select the Flow Bulk load files into Oracle using sql*loader
.
Step 5. Configure load transformation
Select or enter the following attributes of the transformation (left to right):
- Server Storage Connection created in step 1.
- CSV Format created in step 3.
- A wildcard filename that matches the filenames in the cloud storage.
- Oracle Connection created in step 2.
- The wildcard destination table name in the following format:
schema.prefix_*_suffix
, whereschema
is a Oracle schema to load data into. Example:TEST.*
. You don't need to use wildcards when loading into a specific table.
Step 6. Set optional and required parameters
Click Configure
.
Select the Parameters
tab.
Select or enter the following optional and required parameters:
Source Files and Destination Tables
Include files in subfolders
: if this option is enabled, the Flow will load all files that match the wildcard filename (set inFROM
) in all subfolders under the root folder.Exclude and Exclude files
: the optional comma-separated list of files to exclude and/or include. You can use wildcard file names. These options used together with the wildcard filename set inFROM
:- The Flow populates the list of files matching a wildcard filename in
FROM
. - The Flow excludes files based on the list of exclusions set in
Exclude files
. - The Flow includes files based on the list of inclusion set in
Include files
.
- The Flow populates the list of files matching a wildcard filename in
Calculate Oracle table name
: This is an optional parameter used to calculate (using JavaScript) the destination table name based on the source file name. The original file name, without the extension, is stored in the variablename
and the actual table name must be assigned to the variablevalue
. For example, let's assume that the source file name isdb_schema-table_cdc_stream_uuid.csv
. The JavaScript code to calculate the destination table name will bevalue = name.substring(name.indexOf('-') + 1, name.indexOf('_cdc_stream'));
. IMPORTANT: the flow will automatically recognize and transform filenames matching the following template:*_cdc_stream_*.*
. If filenames are matching this template there is no need to provide a transformation script.Maximum Number of Files to Process
: the optional parameter that controls how many files will be loaded in one batch (one run of the Flow). If you expect that at any point in time there will be hundreds of thousands or millions (10^6 - 10^7) of files in the source folder(s), it is recommended to limit the number of files to process to more manageable: hundreds to thousands (10^3-10^5).Maximum Number of Parallel Loads
: if the value of this optional parameter is greater than 1, the Flow will create multiple threads to load data in parallel –– one thread per table but not more than the threshold set in this parameter.Parallel
: enable this flag if you have more than one source-to-destination transformation that you want to execute in parallel.Purge File if Success
: if this option is enabled (by default), the source files will be automatically deleted after theCOPY INTO
command is successfully executed.
Debug
Log each executed SQL statement
: enable this flag if you want to log each automatically generated and executed SQL statement, including before/after Load SQL.
Error recovery
Continue loading data into other tables if Error
: if this parameter is enabled (it is disabled by default) and there is an error when loading data into some table the flow will continue loading data into other tables. If configured the notification will be sent to the webhook.
Load parameters
Action
: can beINSERT/APPEND/REPLACE/TRUNCATE
which inserts records from the file(s) into the Oracle table,MERGE
which merges records from the file with the records in the Oracle table, andCDC MERGE
which loads CDC events (INSERT
/UPDATE
/DELETE
) in the same order as they originated in the source database.MERGE
andCDC MERGE
require configuring theLookup Fields
or/and enabling the .How to MERGE
: the type of SQL to execute when merging data. The following options are available: DELETE/INSERT - DELETE all records in the actual table that also exist in the temp table, then INSERT all records from the temp table into the actual table; Native MERGE - execute native MERGE SQL.User-defined MERGE SQL
: this SQL is used to MERGE new records and updates into the existing table. The following parameters are automatically populated and can be referenced as {TOKEN} in the SQL:• {TABLE}: The table to merge data into.
• {TEMP_TABLE}: The table to merge data from.
• {KEY_FIELDS}: The fields uniquely identifying the records in both tables.
• {FIELDS}: The fields to insert/update in the table to merge data into.
• {INSERT_FIELDS}: The values of the fields to insert.
• {UPDATE_FIELDS}: The fields and values to update in the format field=value,field=value.
• {UPDATE_CONDITIONS}: The condition for matching records in the actual and temp/staging tables.
If nothing is entered in this field, the default automatically generated MERGE SQL will be used.
Lookup Fields
: the comma-separated list of fields that uniquely identify the record in the target Oracle table.- If this option is enabled and
Lookup Fields
is empty, the system will try to predict the columns that uniquely identify the record in the target Oracle table.
Hardcoded sql*loader parameters
Content of the Control File
: optional content of the sql*loader control file. By default, the Flow will generate a control file automatically. The following tokens can be used as a part of the control file:{FILE_TO_LOAD}
: the data file name{TABLE}
: the destination table name{MODE}
: the load mode (INSERT/APPENT/REPLACE/TRUNCATE){COLUMNS}
: the destination table columns{DELIMITER}
: the column delimited in CSV file{QUOTE}
: the character used as quote in CSV fileUPDATE_COLUMNS
: the columns to update{DIRECT}
: DIRECT or CONVENTIONAL load{SKIP}
: number of rows to skip (0 if file has no column header, otherwise 1){ERRORS}
: number of errors to tolerate
Parameters for automatically generated sql*loader control file
Load Path
: Direct or Conventional. Conventional Path loading uses the SQL INSERT statement to load data into the database. It processes data through the SQL engine and follows the usual data validation, trigger, and constraint mechanisms. Direct Path loading bypasses the SQL engine and loads data directly into the Oracle data files. This method is designed for high-speed data loading and is especially useful for large data sets.Max # of Errors to Ignore
: the maximum number of errors to ignore during the load.NULLIF clause
: In SQL*Loader, the NULLIF clause is used to specify that a field should be treated as NULL if it matches a certain condition. This is particularly useful for loading data where specific values (such as ‘null’ or empty strings) should be interpreted as NULL in the database. The format for this parameter is column=condition;column=condition. Example first_name = 'null';last_name = 'null'. Column names can be enclosed in double quotes.
ELT
Set variables for Before and After SQL
: code in JavaScript to set the variables that can be referenced as {varname} in the Before and/or After SQL. To set the variable add it as a key-value pair to the script variable 'vars'.- Example:
vars.put('schema', 'abc');
- Example of the SQL:
insert into test (schema_name) values ('{schema}')
- Example:
Before LOAD SQL
: this SQL will be executed on the Oracle Connection before loading data. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.Ignore errors when executing Before LOAD SQL
:if this option is enabled, and there is an error whenBefore LOAD SQL
is executed, the error will be ignored.Before LOAD SQL is a script
: if this option is enabled 'before LOAD SQL' will be executed as ascript.After LOAD SQL
: this SQL will be executed on the Oracle Connection after loading data. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.Ignore errors when executing After LOAD SQL
:if this option is enabled, and there is an error whenAfter LOAD SQL
is executed, the error will be ignored.After LOAD SQL is a script
: if this option is enabled 'after LOAD SQL' will be executed as ascript.
Handling source schema changes
Alter target table if the source has columns that the target table doesn't have
: if this option is enabled (it is disabled by default), and the source has different fields than the target table, the system will add extra fields to the target table.
Step 7. Optionally, add a mapping.
The mapping can be used to:
- Rename the column.
- Exclude the column.
- Change the column data type for tables automatically created by the flow.
Configure mapping
Select transformation and click Configure
.
Add mapping for column CURRENT_DATE
.
The renaming or excluding the column in mapping will work for all tables that have this column. If the table does not have the column the mapping will be ignored.
Step 8. Optionally, add more source-to-destination transformations
Click [+]
and repeat steps 5 to 6 for more locations and/or file name templates.
Comments
0 comments
Please sign in to leave a comment.