Overview
A Bulk Load is a method provided by a database management system to load multiple rows of data from the source file into a database table. Etlworks includes several Flow types optimized for the bulk load.
When to use this Flow type
Usually, bulk operations are not logged, and transactional integrity is not enforced. Often, bulk operations bypass triggers and integrity checks. This improves performance for loading large amounts of data quite significantly.
Use this Flow type when:
- You simply want to load CSV files that already exist in the cloud or local storage into the destination database without applying any transformations or transferring files over the network.
- When you want to load CSV files created by the CDC Flow.
This Flow does not transform the data. Read how to ETL data into databases using bulk load.
Prerequisites
The destination database must support bulk load operations.
Below are some of the examples for the most commonly used databases:
- SQL Server BULK INSERT statement
- RDS SQL Server BULK INSERT statement using Amazon S3 integration
- Postgres COPY command
- RDS Postgres COPY command using Amazon S3 integration
- MySQL LOAD DATA INFILE statement
- Amazon Aurora MySQL LOAD DATA INFILE statement using Amazon S3 integration
- Oracle Inline External Tables
How it works
- The Flow reads the names of all files matching the wildcard in the specific location in the cloud or file storage. It can traverse the subfolders as well.
- The Flow calculates the destination table names based on the source file names.
- The Flow creates multiple threads, one per destination table, but not more than the configurable threshold.
- The Flow generates and executes the BULK LOAD command to load files into the temporary or staging table.
- The Flow mergers data in the temporary or staging table with data in the actual table.
Features
The Flow supports INSERT, MERGE
(UPSERT), and CDC MERGE
. With CDC
MERGE
, the Flow updates the actual destination table by applying INSERT
/UPDATE
/DELETE
events in the same order as they were originated in the source database and recorded in CSV files. It performs it in the most efficient way possible.
Other Features | Description |
Monitor source schema changes | The Flow is able to CREATE and ALTER destination tables to keep up with schema changes in the source. |
Delete loaded source files | The Flow can be configured to delete successfully loaded source files. |
Load data in parallel | The Flow is able to load data into multiple destination tables in parallel. |
Process files in subfolders | The Flow is able to process files in multiple nested subfolders. |
Create all or selected columns as TEXT | The Flow is able to create all or selected columns as TEXT which mitigates issues related to the source schema drift. |
Process
Step 1. Create a new storage Connection
This Connection will be used as a source to access files stored in the cloud storage or file storage.
The Connection can be one of the following:
Step 2. Create a new Database Connection
This Connection will be used as a destination.
It is recommended to enable the auto-commit for the destination connection.
Step 3. Create a new CSV Format
Keep the default values for all parameters.
Step 4. Create a new Flow
This Flow will be used to load files in the file or cloud storage into the destination database.
In Flows
click [+]
, type in bulk load files
, and select the Flow.
Step 5. Configure load transformation
Select or enter the following attributes of the transformation (left to right):
- File or Cloud storage Connection created in step 1.
- CSV Format created in step 3.
- A wildcard filename that matches the filenames in the cloud storage. Use
.csv
extension for uncompressed files and.gz
for compressed. - Destination database Connection created in step 2.
- The fully qualified wildcard destination table name, for example
DB.SCHEMA.*
.
Step 6. Set required and optional parameters
Click MAPPING
.
Select the Parameters
tab.
Select or enter the following optional and required parameters:
Source Files and Destination Tables
Include files in subfolders
: if this option is enabled, the Flow will load all files that match the wildcard filename (set inFROM
) in all subfolders under the main bucket or blob.Exclude and Exclude files
: the optional comma-separated list of files to exclude and/or include. You can use wildcard file names. These options used together with the wildcard filename set inFROM
:- The Flow populates the list of files matching a wildcard filename in
FROM
. - The Flow excludes files based on the list of exclusions set in
Exclude files
. - The Flow includes files based on the list of inclusion set in
Include files
.
- The Flow populates the list of files matching a wildcard filename in
Calculate destination table name
: this is a required parameter used to calculate (using JavaScript) the destination table name based on the source file name. The original file name, without the extension, is stored in the variablename
and the actual table name must be assigned to the variablevalue
. For example, let's assume that the source file name isdb_schema-table_cdc_stream_uuid.csv
. The JavaScript code to calculate the destination table name will bevalue = name.substring(name.indexOf('-') + 1, name.indexOf('_cdc_stream'));
.Maximum Number of Files to Process
: the optional parameter that controls how many files will be loaded in one batch (one run of the Flow). If you expect that at any point in time there will be hundreds of thousands or millions (10^6 - 10^7) of files in the source folder(s), it is recommended to limit the number of files to process to more manageable: hundreds to thousands (10^3-10^5).Maximum Number of Parallel Loads
: if the value of this optional parameter is greater than 1, the Flow will create multiple threads to load data in parallel –– one thread per table but not more than the threshold set in this parameter.Parallel
: enable this flag if you have more than one source-to-destination transformation that you want to execute in parallel.Create new destination connection for each table
: if this parameter is enabled the new destination connection will be created to load data into each destination table. The connections will be closed when the flow will finish loading data into all tables.Purge Files if Success
: if this parameter is enabled (default) the source files will be automatically deleted after they have been successfully loaded.
Debug
Log each executed SQL statement
: enable this flag if you want to log each automatically generated and executed SQL statement, including before/after Load SQL.
Error recovery
Continue loading data into other tables if Error
: if this parameter is enabled (it is disabled by default) and there is an error when loading data into some table the flow will continue loading data into other tables. If configured the notification will be sent to the webhook.
Load and Merge
Bulk Load SQL
: the Bulk Load SQL is used to load files in the cloud or file storage into the staging or temporary tables in the destination database. This is a required parameter. The following {tokens} can be used in the Bulk Load SQL:- {TABLE} - the table to load data into,
- {FILE_TO_LOAD} - the name of the file to load with path and extension,
- {FILE} - the name of the file to load without path ,
- {FULL_FILE_NAME} - same as {FILE_TO_LOAD},
- {FILE_NO_EXT} - the name of the file to load without path and extension, {EXT} - the extension of the file to load without '.'
- Example for Azure SQL Server:
-
BULK INSERT {TABLE}
FROM '{FILE_TO_LOAD}'
WITH (
DATA_SOURCE = 'BulkLoadDataSource',
FIELDTERMINATOR = ',',
FORMAT='CSV',
FIRSTROW = 2,
MAXERRORS = 10
)
Action
: can beINSERT
which inserts records from the file(s) into the destination table,MERGE
which merges records from the file with the records in the destination table, andCDC MERGE
which loads CDC events (INSERT
/UPDATE
/DELETE
) in the same order as they were originated in the source database.MERGE
andCDC MERGE
require configuring theLookup Fields
or/and enabling the .How to MERGE
: defines how flow merges data in the temp or staging table with the data in the actual table. The default is DELETE/INSERT: DELETE all records in the actual table that also exist in the temp table, then INSERT all records from the temp table into the actual table. If this parameter is set to MERGE the flow will execute native MERGE SQL if it is supported by the destination database. Note that not all databases support MERGE (UPSERT). For example, PostgreSQL and MySQL do not.Load into staging table
: by default, the Flow will attempt to create and load data into the temporary table. Not all databases support the bulk load into the temp table. When this parameter is enabled (it is disabled by default) the flow will create the staging table instead of temporary. It will automatically drop the staging table on the exit.User-defined MERGE SQL
: by default, the MERGE SQL is generated automatically. Use this parameter to override automatically generated SQL. The following {tokens} are available:- {TABLE} - the destination table.
- {TEMP_TABLE} - the staging or temp table.
- {FIELDS} - the fields to INSERT/UPDATE in the destination table.
- {INSERT_FIELDS} - the values of the fields to INSERT into the destination table.
- {UPDATE_FIELDS} - the fields and values to UPDATE in the format field=value,field=value.
- {UPDATE_CONDITIONS} - the condition to match records in the actual destination and temp/staging tables.
Lookup Fields
: the comma-separated list of fields that uniquely identify the record in the destination table.- If this option is enabled and
Lookup Fields
is empty, the system will try to predict the columns that uniquely identify the record in the destination table.
ELT
Set variables for Before and After SQL
: code in JavaScript to set the variables that can be referenced as {varname} in the Before and/or After SQL. To set the variable add it as a key-value pair to the script variable 'vars'.- Example:
vars.put('schema', 'abc');
- Example of the SQL:
insert into test (schema_name) values ('{schema}')
- Example:
Before load SQL
: this SQL will be executed on the destination Connection before loading data.Ignore errors when executing Before load SQL
: if this parameter is enabled, and there is an error whenBefore load SQL
is executed, the error will be ignored.Before Load SQL is a script
: if this parameter is enabled,Before Load SQL
will be executed as a SQL script.After load SQL
: this SQL will be executed on the destination Connection after loading data into the destination table.Ignore errors when executing After load SQL
: if this parameter is enabled and there is an error whenAfter load SQL
is executed, the error will be ignored.After Load SQL is a script
: if this parameter is enabled,After Load SQL
will be executed as a SQL script.
Handling source schema changes
Alter target table if the source has columns that the target table doesn't have
: if this parameter is enabled (it is disabled by default), and the source has different fields than the destination table, the system will add missing fields to the destination table.Create all actual tables with only TEXT columns
: if this parameter is enabled (it is disabled by default) the flow will create all destination tables with only TEXT columns. Enable this option to make the load more resilient to the source schema drift. By default, the flow creates the actual table using the first file as a pattern. It then alters the actual table by adding new columns but it never modifies the data types of the existing columns which could cause issues if there are multiple source files to load into the same destination table and some of these files have columns with different data types (compare to each other).Actual tables with all TEXT columns
: a comma-separated list of regular expressions matching the table names for which the flow should create all columns asTEXT
.
Here are the rules for regexes:
-
- Regexes are case insensitive.
- If regex includes comma (
,
) it must be escaped (/,
). - Java regex syntax must be used.
- If the destination table name is configured using fully qualified (DB.SCHEMA.TABLE) or partially qualified (SCHEMA.TABLE) notation the regex should reflect it. Example (include tables
country
andcity
):text.*
->.*country,.*city
.
TEXT columns
: a comma-separated list of regular expressions matching the column names which should be created asTEXT
.
Here are the rules for 3 parameters above:
-
-
- If
Create all actual tables with only TEXT columns
is enabled all columns in all new tables and new columns in the existing tables will be created asTEXT
. - If the table name matches the regex in
Actual tables with all TEXT columns
andTEXT columns
is empty all columns in that table will be created asTEXT
. - If
TEXT columns
is not empty and the column name in any table matches any regex configured in this field the column will be created asTEXT
.
- If
-
Create temp tables with only TEXT columns
: if this parameter is enabled (it is disabled by default) the flow will create temporary tables with only TEXT columns. Enable this option to make the load more resilient to the source schema drift. By default, the flow creates the temp table using the first file as a pattern. It then alters the temp table by adding new columns but it never modifies the data types of the existing columns which could cause issues if there are multiple source files to load into the same destination table and some of these files have columns with different data types (compare to each other). This option is automatically enabled ifCreate all actual tables with only TEXT columns
is enabled or if the table name matches the regex configured inActual tables with all TEXT columns
.
Step 7. Optionally, add a mapping.
The mapping can be used to:
- Rename the column.
- Exclude the column.
- Change the column data type for the tables created by the flow.
Configure mapping
Select transformation and click MAPPING
.
Add mapping for column CURRENT_DATE
.
The renaming or excluding the column in mapping will work for all tables that have this column. If the table does not have the column the mapping will be ignored.
Step 8. Optionally, add more source-to-destination transformations
Click [+]
and repeat steps 5 to 6 for more locations and/or file name templates.
Comments
0 comments
Please sign in to leave a comment.