Using Automatic Partitioning
One effective way to enhance performance when working with large datasets is by leveraging automatic partitioning in ETL processes. Automatic partitioning allows data to be divided into smaller, manageable chunks based on specified conditions, such as date ranges or numeric ranges. This approach enables parallel processing of data, significantly improving the speed and efficiency of data extraction and loading operations. By processing each partition independently, you can optimize resource utilization, reduce query execution time, and minimize the impact on the database.
For more details on how to set up automatic partitioning and examples for different databases, see the Automatic Partitioning article.
Consider setting the Fetch Size when extracting data from large datasets
Many database drivers (MySQL, Redshift, and others), don't set the upper limit on the number of records that can be retrieved in one database call. When extracting data from larger datasets, it could cause an OutOfMemory error and prevent the system from extracting and loading the data.
The Fetch Size
field can be used to set the maximum number of records that will be retrieved in one database call, thus limiting the memory consumption within the JVM. You can set the Fetch Size
at the Connection level, or override it at the transformation level.
Consider setting the Max Field Size when extracting data from wide tables
This parameter sets the limit for the maximum number of bytes that can be returned for character and binary column values in a ResultSet
object produced by this Statement object. This limit applies only to BINARY
, VARBINARY
, LONGVARBINARY
, CHAR
, VARCHAR
, NCHAR
, NVARCHAR
, LONGNVARCHAR
, and LONGVARCHAR
fields. If the limit is exceeded, the excess data is silently discarded. For maximum portability, use values greater than 256.
You can set the Max Field Size
at the Connection level, or override it at the transformation level.
Use streaming and bind variables if possible, but avoid using native SQL
Enable Force streaming when streaming is not supported
Read about force streaming.
Enable Auto Commit
If you are loading large datasets into the database, enable Auto Commit
for the destination database Connection.
Use BULK MERGE or BULK DELETE/INSERT
In Etlworks, you can MERGE (UPSERT) records extracted from the source with records in the destination table when working with relational databases. MERGE can be slow so consider using BULK MERGE or BULK DELETE/INSERT when possible.
Create indexes
If you are running custom SQL queries or using MERGE
/ UPDATE
/ DELETE
/ BULK MERGE
/BULK DELETE/INSERT
/IfExist
SQL actions, make sure that all relevant database indexes exist in the destination table.
If the destination table is automatically created by the Etlworks flow consider enabling option Create Indexes
, which can be found under transformation->Configure->Parameters.
Use Flows optimized cloud data warehouses
Snowflake, Redshift and other cloud data warehouses supported by Etlworks are columnar databases optimized for data retrieval operations, so loading data using per-record DML statements (INSERT
/ UPDATE
/ DELETE
) can be very slow, specifically for larger datasets.
Always use Flows optimized for these databases.
- Read how to efficiently load data into Snowflake.
- Read how to efficiently load data into Redshift.
- Read how to efficiently load data into Google BigQuery.
- Read how to efficiently load data into Azure Synapse Analytics.
- Read how to efficiently load data into Vertica.
- Read how to efficiently load data into Greenplum.
Comments
0 comments
Please sign in to leave a comment.