Overview
Paginated APIs are commonly used to retrieve large datasets in smaller chunks, improving performance and usability. Etlworks provides flexible options to integrate and work with paginated APIs using various pagination methods, such as offset-based, cursor-based, token-based, based on presence of data and time-based pagination.
General Steps for Working with Paginated APIs
Step 1. Create an API Connection:
• Go to Connections > Add connection > HTTP.
• Set the base URL and any required authentication parameters.
• Use variables like {offset}, {page}, or {cursor} to dynamically update API requests during pagination.
Step 2. Extract Response Metadata:
• Use Postprocessor in HTTP connection to parse API responses and extract metadata (e.g., next_cursor, total_pages, or next_page_token).
Step 3. Create ETL flow.
• Use connection created in step 1 as a source.
Step 4. Set Up Loop:
• Create Nested Flow.
• Define Flow Variables as key-value pairs. The keys must be the same as {variables} in the HTTP connection URL. Set starting values.
• Add ETL flow created in step 3 to the nested flow and configure Script Loop to iterate through pages until the end condition is met.
Step 5. Handle Rate Limiting:
• Use Throttling parameters or configure retries if the API enforces rate limits.
Pagination Methods
Etlworks supports almost any pagination method. The following are the most common:
Offset-Based Pagination
Description: Uses an offset parameter to indicate where the page starts and a limit parameter to define the number of results per page.
The other name of this pagination technique is Page-based.
Common Parameters:
• page: The index of the first record to retrieve (e.g., 1 for the first page). The other common name for this parameter is offset.
• size: The maximum number of records to retrieve. The other common names for this parameter are pageSize and limit.
Example:
GET /api/items?page=1&size=10
GET /api/items?page=2&size=10
Stop conditions:
• The loop stops when there is no more data to fetch (e.g., an empty response or no data node in the API response), or
• When the loop reaches the total number of iterations required to fetch all the data, based on the total number of records provided in the metadata.
Offset-Based Pagination: Implementation
Etlworks provides flexibility for implementing Offset-Based Pagination by handling two common stop conditions:
Case 1: Stop when there is no more data to fetch.
Case 2: Stop when the loop reaches the total number of iterations based on the total number of records.
Below are step-by-step instructions for implementing both cases.
Step 1: Create an HTTP Connection
• Go to Connections > Add Connection > HTTP.
2. Set Up the Connection:
• URL: Enter the API’s base URL (e.g., https://api.example.com).
• Use variables for pagination parameters, such as {page} and {size}:
/api/items?page={page}&size={size}
• Configure the authentication and set other parameters.
Step 2: Extract Response Metadata
• Add a script to parse the API response and extract pagination-related metadata to the Postprocessor field. Example:
// skip auth requests
if (!isAuthRequest) {
// parse the API response
var responseData = JSON.parse(response);
// get etl config
var etlConfig = com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext() ? com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext().getData() : null;
// adjust based on API response structure
var records = responseData.data;
if (etlConfig && (!records || records.length === 0)) {
// request to stop the loop
etlConfig.setRequestStop(true);
}
}
// return unmodified response, modify if needed
value = response;
• Optional: Extract Total Records for Iteration Limit:
// add this block before the last line, which must be value = ....
// Adjust based on API's response structure
var metadata = responseData.metadata;
if (metadata && metadata.total_records && !etlConfig.getValue("total_records")) {
// Set total records
etlConfig.setValue("total_records", metadata.total_records);
}
• Save the HTTP connection.
Step 3: Create the ETL Flow
• Go to Flows > Add flow > Select flow where the Source is Web Service.
• Name this flow.
• Configure source-to-destination transformation where the source is HTTP connection created in step 1.
• When configuring source-to-destination transformation enable Configure > Parameters > Ignore when there are no records. By enabling this flag you will avoid an exception if the response from API contains no data.
• Save the flow.
Step 4: Setup Loop
1. Create a Nested Flow:
• Go to Flows > Add flow > Nested Flow.
• Name this flow.
2. Add the ETL Flow:
• Add the ETL flow created in Step 3 as a step in this nested flow.
3. Define Flow Variables:
• Go to the Parameters tab in the nested flow.
• Under Flow Variables, define key-value pairs and starting values. The keys must be the same as {variables} in the HTTP connection URL. Only set the size variable. We will be setting the page dynamically.
Example:
size |
10 |
4. Configure the Loop:
• Add a Script Loop to dynamically update the pagination parameters. The scrip loop ends when the loop function returns null.
Case 1: Stop When No More Data
Use this loop script when the API stops returning data:
// get the variables for the loop
var vars = com.toolsverse.config.SystemConfig.instance().getProperties();
var page = vars.get("page");
var currentPage;
// if page is not already set
if (!page) {
currentPage = 1;
} else { // otherwise increment the page
currentPage = parseInt(page, 10) + 1;
}
// Set the page variable as string
vars.put("page", currentPage.toString());
// stop the loop if request to stop is set in the Preprocessor
value = etlConfig.isRequestStop() ? null : "continue";
Case 2: Stop After Total Records
Use this loop script when the total number of records is provided:
// get the total records set in Postprocessor
var totalRecords = etlConfig.getValue("total_records");
// stop the loop immediately if totalRecords is not set or 0
if (totalRecords == 0) {
value = 0;
} else { // otherwise continue
// get the variables for the loop
var vars = com.toolsverse.config.SystemConfig.instance().getProperties();
// the size of the page
var size = parseInt(vars.get("size"), 10);
// calculate total number of pages
var totalPages = Math.ceil(totalRecords / size);
var page = vars.get("page");
var currentPage;
// if page is not already set
if (!page) {
currentPage = 1;
} else { // otherwise increment the page
currentPage = parseInt(page, 10) + 1;
}
// Set the page variable
vars.put("page", currentPage.toString());
// stop the loop if currentPage > totalPages
value = currentPage > totalPages ? null : "continue";
}
Cursor-Based Pagination
Description: Uses a cursor (a unique identifier or token) to determine the starting point of the next page.
Common Parameters:
• cursor: A token indicating the starting point for the next page.
• limit: The number of records to retrieve per page.
Example:
GET /api/items?cursor=abc123&limit=10
The API response includes the next_cursor to be used in subsequent requests:
{
"data": [...],
"next_cursor": "xyz789"
}
Stop conditions:
• The loop stops when the next_cursor is empty or there is no node next_cursor in the API response.
Cursor-Based Pagination: Implementation
Etlworks provides flexibility for implementing Cursor-Based Pagination by handling APIs that use a cursor or token to determine the starting point of the next page. Below are step-by-step instructions for implementation.
Step 1: Create an HTTP Connection
• Go to Connections > Add Connection > HTTP.
2. Set Up the Connection:
• URL: Enter the API’s base URL (e.g., https://api.example.com).
• Use variables for pagination parameters, such as {cursor} and {limit}.
/api/items?{cursor}&limit={limit}
Note that since there is no initial value for {cursor} it is set as {cursor} in URL instead of cursor={cursor}. It will allow us to set the empty value for the {cursor} variable.
• Configure the authentication and any other required parameters.
Step 2: Extract Response Metadata
• Add a script to parse the API response and extract pagination-related metadata to the Postprocessor field. Example:
// skip auth requests
if (!isAuthRequest) {
// Parse the API response
var responseData = JSON.parse(response);
// get etl config
var etlConfig = com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext() ? com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext().getData() : null;
// Extract the cursor for the next page
// Adjust based on API's response structure
var nextCursor = responseData.next_cursor;
if (!nextCursor) {
// Request to stop the loop if no next cursor is available
if (etlConfig) {
etlConfig.setRequestStop(true);
}
} else {
// Set the next cursor in the flow variables
if (etlConfig) {
etlConfig.setValue("cursor", nextCursor);
}
}
}
// Return unmodified response to the ETL flow, modify if needed
value = response;
• Save the HTTP connection.
Step 3: Create the ETL Flow
• Go to Flows > Add flow > Select flow where the Source is Web Service.
• Name this flow.
• Configure source-to-destination transformation where the source is HTTP connection created in step 1.
• When configuring source-to-destination transformation enable Configure > Parameters > Ignore when there are no records. By enabling this flag you will avoid an exception if the response from API contains no data.
• Save the flow.
Step 4: Setup Loop
1. Create a Nested Flow:
• Go to Flows > Add flow > Nested Flow.
• Name this flow.
2. Add the ETL Flow:
• Add the ETL flow created in Step 3 as a step in this nested flow.
3. Define Flow Variables:
• Go to the Parameters tab in the nested flow.
• Under Flow Variables, define key-value pairs and starting values. The keys must be the same as {variables} in the HTTP connection URL. Only set the size variable. We will be setting the cursor dynamically.
Example:
size |
10 |
4. Configure the Loop:
• Add a Script Loop to dynamically update the pagination parameters. The scrip loop ends when the loop function returns null.
Script Loop Example
Use this loop script when the API stops returning a cursor:
// Get the variables for the loop
var vars = com.toolsverse.config.SystemConfig.instance().getProperties();
// Get the next cursor set by parsing the response in Postprocessor
var cursor = etlConfig.getValue("cursor");
if (cursor) {
// set the value of the variable {cursor}
vars.put("cursor", "cursor=" + cursor);
} else {
// set to empty
vars.put("cursor", "");
}
// Stop the loop if the Postprocessor sets a request stop
value = etlConfig.isRequestStop() ? null : "continue";
Token-Based Pagination
Description: Uses a next_page_token or similar field provided in the API response to fetch the next set of results.
Common Parameters:
• token or page_token: A token for the next page.
Example:
GET /api/items?page_token=abc123
The API response includes the next_page_token:
{
"data": [...],
"next_page_token": "xyz789"
}
Stoping conditions:
• The loop stops when the next_page_token is empty or there is no node next_page_token in the API response.
Token-Based Pagination: Implementation
The implementation is exactly the same as for Cursor-Based pagination except there is typically no size parameter.
Pagination Based on Presence of the data
This type of pagination relies on the presence and contents of the data node in the API response. The process stops when either:
1. The data node is empty: [].
2. The data node is absent in the response.
This method is often used when the API does not explicitly return pagination metadata like cursor, page, or token.
Typical API Response:
{
"optional_metadata": 21423,
"data": [
{"key": 1, "value": "value1"},
{"key": 2, "value": "value2"}
].
}
Stopping Conditions is empty data not no data node:
{
"optional_metadata": 21423,
"data": []
}
Pagination Based on Presence of the data Node: Implementation
Etlworks supports pagination based on the presence and contents of the data node in the API response. This approach is commonly used when the API does not provide explicit pagination metadata, such as cursor, page, or token. Below are step-by-step instructions for implementing this pagination technique.
Step 1: Create an HTTP Connection
• Go to Connections > Add Connection > HTTP.
2. Set Up the Connection:
• URL: Enter the API’s base URL (e.g., https://api.example.com).
• Configure the authentication and any other required parameters.
Step 2: Extract and Parse Response
• Add a script to parse the API response to the Postprocessor field. Example:
// skip auth requests
if (!isAuthRequest) {
// Parse the API response
var responseData = JSON.parse(response);
// get etl config
var etlConfig = com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext() ? com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext().getData() : null;
// Check if the `data` node exists and is not empty
// Adjust based on API's response structure
var records = responseData.data;
if (etlConfig && (!records || records.length === 0)) {
// Request to stop the loop if `data` is absent or empty
etlConfig.setRequestStop(true);
}
}
// Must be in the last line of the code
// Return unmodified response to the ETL flow, modify if needed
value = response;
• Save the HTTP connection.
Step 3: Create the ETL Flow
• Go to Flows > Add flow > Select flow where the Source is Web Service.
• Name this flow.
• Configure source-to-destination transformation where the source is HTTP connection created in step 1.
• When configuring source-to-destination transformation enable Configure > Parameters > Ignore when there are no records. By enabling this flag you will avoid an exception if the response from API contains no data.
• Save the flow.
Step 4: Setup Loop
1. Create a Nested Flow:
• Go to Flows > Add flow > Nested Flow.
• Name this flow.
2. Add the ETL Flow:
• Add the ETL flow created in Step 3 as a step in this nested flow.
3. Configure the Loop:
• Add a Script Loop . The scrip loop ends when the loop function returns null.
Script Loop Example
Use this loop script when the API stops returning a cursor:
// Stop the loop if the Postprocessor sets a request to stop
value = etlConfig.isRequestStop() ? null : "continue";
Time-Based Pagination
Description: Uses a timestamp or other time-based parameter to paginate through data, often for logs or real-time data.
Common Parameters:
• start_time: The starting timestamp.
• end_time: The ending timestamp.
• limit: The number of records per page.
Example:
GET /api/logs?start_time=2025-01-01T00:00:00Z&size=10
Time-Based Pagination: Implementation
Time-based pagination uses a timestamp or time-based parameter to paginate through data, often for logs, real-time data, or systems that append data over time. This method works by fetching data within specific time ranges and iteratively updating the time range to retrieve subsequent records.
Below are step-by-step instructions for implementing time-based pagination in Etlworks.
Step 1: Create an HTTP Connection
• Go to Connections > Add Connection > HTTP.
2. Set Up the Connection:
• URL: Enter the API’s base URL (e.g., https://api.example.com).
• Use variables for pagination parameters, such as {start_time}, {end_time}, and {limit}:
/api/logs?start_time={start_time}&limit={limit}
• Configure the authentication and any other required parameters.
Step 2: Extract and Parse Response
• Add a script to parse the API response to the Postprocessor field. Example:
// skip auth requests
if (!isAuthRequest) {
// Parse the API response
var responseData = JSON.parse(response);
// get etl config
var etlConfig = com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext() ? com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext().getData() : null;
// Extract the data array and find the last timestamp
// Adjust based on API's response structure
var records = responseData.data;
if (etlConfig) {
if (!records || records.length === 0) {
// Request to stop the loop if no more data is available
etlConfig.setRequestStop(true);
} else {
// Extract the last timestamp from the data
var lastRecord = records[records.length - 1];
var lastTimestamp = lastRecord.timestamp; // Adjust based on API's structure
etlConfig.setValue("start_time", lastTimestamp);
}
}
}
// Must be in the last line of the code
// Return unmodified response to the ETL flow, modify if needed
value = response;
• Save the HTTP connection.
Step 3: Create the ETL Flow
• Go to Flows > Add flow > Select flow where the Source is Web Service.
• Name this flow.
• Configure source-to-destination transformation where the source is HTTP connection created in step 1.
• When configuring source-to-destination transformation enable Configure > Parameters > Ignore when there are no records. By enabling this flag you will avoid an exception if the response from API contains no data.
• Save the flow.
Step 4: Setup Loop
1. Create a Nested Flow:
• Go to Flows > Add flow > Nested Flow.
• Name this flow.
2. Add the ETL Flow:
• Add the ETL flow created in Step 3 as a step in this nested flow.
3. Define Flow Variables:
• Go to the Parameters tab in the nested flow.
• Under Flow Variables, define key-value pairs and starting values. The keys must be the same as {variables} in the HTTP connection URL.
Example:
size |
10 |
start_time |
2025-01-01T00:00:00Z |
4. Configure the Loop:
• Add a Script Loop to dynamically update the pagination parameters. The scrip loop ends when the loop function returns null.
Script Loop Example
Use this loop script when the API stops returning data:
// Stop the loop if the Postprocessor sets a request stop
value = etlConfig.isRequestStop() ? null : "continue";
If you want to dynamically calculate the initial start_time use the following script. Example:
// Get the current timestamp
var now = new Date();
// Subtract 3 days
var threeDaysAgo = new Date(now.getTime() - (3 * 24 * 60 * 60 * 1000));
var startTime = threeDaysAgo.toISOString(); // Convert to ISO 8601 format
// Get the variables for the loop
var vars = com.toolsverse.config.SystemConfig.instance().getProperties();
vars.put("start_time", startTime); // Set the initial start_time
Stop the pagination loop
While the examples in this article illustrate common approaches to handling paginated APIs using script loops, it’s important to note that pagination logic can vary significantly depending on the use case. For instance:
• You might use a SQL loop that dynamically populates pagination parameters such as page and size.
• You could implement a custom script loop tailored to specific API behaviors or response formats.
• In some cases, a file-based loop might drive the pagination, with each file representing a page or set of parameters.
Regardless of the loop type, Etlworks provides a universal mechanism to stop the pagination loop dynamically. By leveraging the Postprocessor in the HTTP connection, you can evaluate the API response and, if necessary, stop the loop. For example, if the data node is empty or missing, you can set the loop to stop using the following line of code:
// skip auth requests
if (!isAuthRequest) {
// get etl config
var etlConfig = com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext() ? com.toolsverse.config.SystemConfig.instance().
getEtlThreadContext().getData() : null;
if (etlConfig) {
// request the loop to stop
etlConfig.setRequestStop(true);
}
}
This method can be used not only for cases where there is no data but also for other conditions, such as specific error codes or unexpected responses.
Note: When the API response contains no data, you must also enable the Configure > Parameters > Ignore when there are no records option in the source-to-destination transformation. This ensures that the flow does not throw an exception when handling an empty response.
By combining flexible loop configurations with the ability to dynamically stop execution, Etlworks enables you to handle diverse and complex pagination scenarios seamlessly.
Handling Throttling and Rate Limits
When implementing pagination, making multiple HTTP calls in a loop to fetch paginated data can potentially lead to throttling or exceed the rate limits imposed by API providers. Many APIs enforce strict rules on the number of requests allowed within a specific time frame, and exceeding these limits can result in errors, delays, or even temporary bans.
To address this, Etlworks provides configurable parameters in the HTTP connection that help manage the frequency of requests and prevent throttling:
• Wait after N calls (ms): This parameter defines how long the system should pause after making a specified number of calls to the same endpoint. For example, if set to 5000 (5 seconds), the system will pause for 5 seconds after reaching the limit set by “Number of calls.”
• Number of calls: This parameter works in tandem with “Wait after N calls (ms)” and specifies the maximum number of requests that can be made to an endpoint before the system pauses. For instance, if set to 100, the system will make 100 calls, then wait for the duration specified in “Wait after N calls (ms).”
• Endpoint: By default, the endpoint is derived from the URI authority (e.g., https://myapi.com/data resolves to myapi.com). You can override this if you need to group requests differently or manage throttling for specific subdomains or paths.
Comments
0 comments
Please sign in to leave a comment.