Overview
In Etlworks, automatic partitioning allows you to configure flows to execute multiple data extraction and loading processes in parallel, based on dynamically generated partition conditions. This technique is especially useful for handling large datasets efficiently by dividing data into smaller, manageable chunks that can be processed concurrently.
What is Automatic Partitioning?
When configuring ETL flows, Etlworks enables the use of “Partition SQL.” Partition SQL is an SQL query that defines the conditions for partitioning data. When Partition SQL is configured, the flow automatically generates separate ETL transformations for each partition returned by the query. These separate transformations can then be executed in parallel, significantly improving processing efficiency.
Partition SQL does not need to use actual database partitions. It can be any SQL that generates start and end conditions (for example, start and end dates) for extracting data in chunks. This allows ETL operations to process data in parallel for each chunk, without being limited to predefined database partitions.
Why Use Partitioning?
Automatic partitioning offers several benefits:
Enhanced Parallel Processing: By breaking the data into smaller chunks, each chunk can be processed simultaneously, speeding up the ETL operations.
Efficient Handling of Large Datasets: Instead of processing an entire large table at once, partitioning enables handling smaller, more manageable pieces.
Flexible Partitioning Strategy: The ability to define custom partition conditions through Partition SQL provides flexibility in how data is divided, allowing for more dynamic data extraction strategies.
Optimized Resource Utilization: Parallel execution of partitions can better utilize system resources by distributing workload across multiple threads or servers.
Which Flows Support Automatic Partitioning?
Any ETL flow where where the source is a relational database supports automatic paritioning.
How to Set Up Automatic Partitioning
Setting up automatic partitioning in Etlworks involves the following steps:
Step 1. Create flow where the source is a relational database. Add source to destination transformation:
Step 2. Click Configure, select Parameters tab and enter Partition SQL:
Step 3. For faster processing enable Use Parallel Threads when processing sources by a wildcard.
Configure other parameters such as Action, Lookup fields, etc.
Step 4. Select Mapping tab and enter Source query.
Importance on Source Query
When setting up automatic partitioning in an ETL flow, entering a source query for extracting data from each partition is required because it allows the system to dynamically adapt to the different conditions associated with each partition. Here’s why this step is necessary:
1. Dynamic Partition-Specific Data Extraction
Each partition represents a distinct segment of the dataset, defined by unique conditions (e.g., date ranges, numeric ranges, or partition numbers). The source query serves as a template that can be adjusted based on the partition parameters generated by the Partition SQL. For example, in a scenario where partitions are based on date ranges, the source query can be dynamically modified to only extract data within the specified date range for each partition. This ensures that only relevant data for each partition is processed, improving performance and efficiency.
2. Handling Different Partitioning Strategies
Since Partition SQL is flexible and does not have to use actual database partitions, the source query must adapt to different partitioning strategies. For instance, if the partitions are based on dynamically generated start and end dates (chunks), the source query needs to incorporate these values to filter the data correctly. By providing a source query, users can customize the data extraction logic to suit their partitioning approach, whether it’s based on database partitions or custom ranges.
3. Executing ETL Flows in Parallel
Automatic partitioning generates separate ETL transformations for each partition. The source query allows each transformation to target its specific subset of data, as defined by the partition parameters. Without a partition-specific source query, there would be no way to distinguish which data should be processed by each parallel execution, potentially leading to overlapping or missing data.
4. Ensuring Data Consistency and Accuracy
By using a source query tailored to each partition, the ETL process ensures that all data is covered without duplication or gaps. The partition parameters (e.g., {PartitionNumber}, {StartDate}, {EndDate}) can be substituted in the source query to filter data for each partition correctly, avoiding issues where data is unintentionally processed multiple times or not at all.
Examples of Partition-Specific Source Queries
Based on Partition Number
For databases that use partition numbers, the source query can be adjusted to filter data using the $PARTITION function, targeting only the relevant partition.
SELECT * FROM Orders WHERE $PARTITION.OrderDateRangePF(OrderDate) = {PartitionNumber};
Based on Date Ranges
If partitions are defined by date ranges, the source query can use these ranges to filter the data:
SELECT * FROM Orders WHERE OrderDate >= '{StartDate}' AND OrderDate < '{EndDate}';
Partitioning Examples
When using automatic partitioning in ETL flows, the “Partition SQL” generates partition-specific parameters that are passed to the source query for data extraction. Below are examples of how Partition SQL and source queries can be configured for different databases.
Before diving into the examples, it is important to understand the difference between physical partitions and logical partitioning:
Physical partitions refer to database-managed segments of a table that are divided based on a specified partitioning strategy, such as range, list, or hash. These partitions are physically separated within the database storage (e.g., different filegroups in SQL Server) and are managed by the database system itself to optimize performance and data management. When working with physical partitions, queries can explicitly target specific partitions, leading to improved efficiency, parallel processing, and simplified maintenance tasks such as archiving or data purging. Physical partitioning is defined at the schema level of the database and usually involves partitioning a table based on a specific column, like a date, ID, or other criteria.
Logical partitioning is a method of dividing data into smaller, manageable chunks based on user-defined conditions rather than the database’s built-in partitioning capabilities. These conditions can be date ranges, numeric ranges, or other criteria used to specify how the data should be split. Logical partitioning is implemented at the ETL (Extract, Transform, Load) level, allowing for custom partitioning strategies that do not rely on the underlying database’s partitioning features. This type of partitioning is ideal for scenarios where data needs to be processed in parallel or when partitioning criteria may vary. Unlike physical partitions, logical partitions do not affect the database’s storage configuration, offering flexibility in the partitioning approach.
Examples Using Physical Partitions
Oracle using Physical Partitions
Partition SQL: Retrieves partition information for a table partitioned by date.
SELECT partition_name AS PartitionName, high_value AS HighValue
FROM user_tab_partitions
WHERE table_name = 'ORDERS';
Source Query: Uses the partition name to extract data from the specified partition.
SELECT * FROM Orders PARTITION ({PartitionName});
SQL Server using Physical Partitions
Partition SQL.
SELECT
p.partition_number AS PartitionNumber,
prv.value AS HighBoundaryValue
FROM
sys.partitions p
JOIN
sys.partition_range_values prv ON p.partition_number = prv.boundary_id
JOIN
sys.tables t ON p.object_id = t.object_id
WHERE
t.name = 'Orders'
AND p.index_id IN (0, 1);
Source Query
DECLARE @PartitionNumber INT = {PartitionNumber};
DECLARE @TableName NVARCHAR(128) = 'Orders';
DECLARE @PartitionFunction NVARCHAR(128) = 'OrderDateRangePF';
DECLARE @PartitionColumn NVARCHAR(128);
DECLARE @SQL NVARCHAR(MAX);
-- Identify the partitioning column
SELECT TOP 1
@PartitionColumn = c.name
FROM
sys.index_columns ic
JOIN
sys.columns c ON ic.object_id = c.object_id AND ic.column_id = c.column_id
WHERE
ic.partition_ordinal > 0
AND OBJECT_NAME(ic.object_id) = @TableName;
-- Stop if no partition column is found
IF @PartitionColumn IS NULL
BEGIN
RAISERROR ('No partition column found for the specified table.', 16, 1);
RETURN;
END
-- Build the dynamic query
SET @SQL = N'SELECT * FROM ' + QUOTENAME(@TableName) +
N' WHERE $PARTITION.' + QUOTENAME(@PartitionFunction) +
N'(' + QUOTENAME(@PartitionColumn) + N') = ' + CAST(@PartitionNumber AS NVARCHAR(10));
-- Execute the dynamic SQL
EXEC sp_executesql @SQL;
MySQL using Physical Partitions
Partition SQL: Queries metadata about table partitions.
SELECT PARTITION_NAME AS PartitionName
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE TABLE_NAME = 'Orders';
Source Query: Uses partition-specific conditions to retrieve data.
SELECT * FROM Orders PARTITION ({PartitionName});
PostgreSQL using Physical Partitions
Partition SQL: Lists partitions for a range-partitioned table.
SELECT inhrelid::regclass AS PartitionTableName
FROM pg_inherits
WHERE inhparent = 'Orders'::regclass;
Source Query: Queries a partition table directly.
SELECT * FROM {PartitionTableName};
DB2 using Physical Partitions
Partition SQL: Retrieves information about partitions in a range-partitioned table.
SELECT PARTITION_NUMBER AS PartitionNumber, LIMITKEY AS HighValue
FROM SYSCAT.DATAPARTITIONS
WHERE TABNAME = 'ORDERS';
Source Query: Uses the partition number in the query condition.
SELECT * FROM Orders WHERE MOD(OrderID, {PartitionNumber}) = 0;
AS400 using Physical Partitions
Partition SQL: Retrieves partition information for a table.
SELECT PARTITION_NUMBER AS PartitionNumber, PARTITION_NAME AS PartitionName
FROM QSYS2.SYSTABLEPARTITIONS
WHERE TABLE_NAME = 'ORDERS';
Source Query: Targets a specific partition using the PartitionNumber.
SELECT * FROM Orders WHERE PartitionColumn = {PartitionNumber};
Informix using Physical Partitions
Partition SQL: Lists the partitions for a fragmented table.
SELECT partnum AS PartitionNumber
FROM sysfragments
WHERE tabname = 'Orders';
Source Query: Uses partition number to filter data.
SELECT * FROM Orders WHERE PARTITION BY (OrderID) = {PartitionNumber};
Examples Using Logical Partitioning
Oracle using Logical Partitioning
Partition SQL: Generates date ranges to partition the data.
SELECT TO_DATE('2022-01-01', 'YYYY-MM-DD') AS start_date, TO_DATE('2022-02-01', 'YYYY-MM-DD') AS end_date FROM DUAL
UNION ALL
SELECT TO_DATE('2022-02-01', 'YYYY-MM-DD'), TO_DATE('2022-03-01', 'YYYY-MM-DD') FROM DUAL
UNION ALL
SELECT TO_DATE('2022-03-01', 'YYYY-MM-DD'), TO_DATE('2022-04-01', 'YYYY-MM-DD') FROM DUAL;
Source Query: Uses the partition parameters to extract data for the specified date range.
SELECT * FROM Orders WHERE OrderDate >= {start_date} AND OrderDate < {end_date};
SQL Server using Logical Partitioning
Partition SQL: Generates partitions by dividing the data based on numeric ranges.
WITH NumberRanges AS (
SELECT 1 AS start_id, 1000 AS end_id
UNION ALL
SELECT 1001, 2000
UNION ALL
SELECT 2001, 3000
)
SELECT start_id, end_id FROM NumberRanges;
Source Query: Extracts data based on the range of IDs for each partition.
SELECT * FROM Orders WHERE OrderID BETWEEN {start_id} AND {end_id};
MySQL using Logical Partitioning
Partition SQL: Generates date ranges to partition the data.
SELECT '2022-01-01' AS start_date, '2022-01-31' AS end_date
UNION ALL
SELECT '2022-02-01', '2022-02-28'
UNION ALL
SELECT '2022-03-01', '2022-03-31';
Source Query: Filters the data for the specified date range.
SELECT * FROM Orders WHERE OrderDate >= '{start_date}' AND OrderDate < '{end_date}';
PostgreSQL using Logical Partitioning
Partition SQL: Uses generate_series to create date ranges.
SELECT generate_series('2022-01-01'::date, '2022-03-01'::date, '1 month') AS start_date,
(generate_series + '1 month'::interval - '1 day'::interval)::date AS end_date;
Source Query: Retrieves data for each partitioned date range.
SELECT * FROM Orders WHERE OrderDate >= '{start_date}' AND OrderDate < '{end_date}';
DB2 using Logical Partitioning
Partition SQL: Generates month-based ranges using a dummy table.
SELECT '2022-01-01' AS start_date, '2022-01-31' AS end_date FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT '2022-02-01', '2022-02-28' FROM SYSIBM.SYSDUMMY1;
Source Query: Filters the data according to the specified range.
SELECT * FROM Orders WHERE OrderDate BETWEEN '{start_date}' AND '{end_date}';
AS400 using Logical Partitioning
Partition SQL: Uses dates to partition the data.
SELECT DATE('2022-01-01') AS start_date, DATE('2022-01-31') AS end_date FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT DATE('2022-02-01'), DATE('2022-02-28') FROM SYSIBM.SYSDUMMY1;
Source Query: Extracts records from the table for each specified date range.
SELECT * FROM Orders WHERE OrderDate BETWEEN '{start_date}' AND '{end_date}';
Informix using Logical Partitioning
Partition SQL: Generates numeric ranges using a system table.
SELECT 1 AS start_id, 1000 AS end_id FROM systables WHERE tabid = 1
UNION ALL
SELECT 1001, 2000 FROM systables WHERE tabid = 1;
Source Query: Retrieves data for each numeric range.
SELECT * FROM Orders WHERE OrderID BETWEEN {start_id} AND {end_id};
Using Partition Parameters in Source Queries and Transformations
When setting up automatic partitioning, users can utilize the values of columns returned by the Partition SQL anywhere in the source query or in the “To” part of the source-to-destination transformation. These columns can be referenced using the syntax {ColumnName}. For example, if the Partition SQL returns columns named StartDate and EndDate, these can be referenced in the source query to filter data based on the partition-specific date range:
SELECT * FROM Orders WHERE OrderDate >= '{StartDate}' AND OrderDate < '{EndDate}';
Additionally, users can use these partition-specific parameters in other areas of the transformation, such as naming output files dynamically:
orders_{EndDate}.csv
Users can also reference the partition index as {PartitionId}. The partition index is a 1-based number representing the row returned by the Partition SQL, making it useful for uniquely identifying or naming partitions:
This flexibility allows for dynamic and customized data processing based on partition-specific criteria, optimizing ETL workflows.
Conclusion
Automatic partitioning with Etlworks simplifies the process of managing large datasets by breaking down data processing tasks into smaller, concurrent ETL flows. By using flexible Partition SQL, users can define custom partitioning strategies that suit their data management needs. Whether extracting data in chunks, loading data into specific segments, or archiving records, automatic partitioning enhances performance and scalability.
Comments
0 comments
Please sign in to leave a comment.