Overview
Most (but not all) of the APIs put the limit on the amount of data that can be returned or uploaded in one request. They provide a pagination mechanism by responding with the information on how to access the next “page” (hence pagination).
In Etlworks, retriving data from the paginated API can be implemented using nested Flow with the database loop, together with the technique called dynamic URLs.
Common pagination patterns
- The API where the amount of data is controlled by the parameters in the request, and the loop stops when there is no more data to return. In this tutorial, we will be using Jira API to illustrate the example.
- The API where the server controls the amount of data by providing the next marker in the response, so the loop stops when there is no next marker. This tutorial will be using API provided by an organization called UKRI Gateway to illustrate the example.
Process
API with the page number
The URL for the typical Jira API endpoint looks like below:
https://company.atlassian.net/rest/api/3/searchjql=query
&startAt=pointer&maxResults=10
When called, it returns the following JSON:
{"expand": "schema,names", "startAt": 0, "maxResults": 10, "total": 2823, "results": [...]}
As you can see, the request contains the starting pointer and the maximum number of results. The response includes the total number of results.
So, the algorithm to iterate through the pages will look like below:
- Make the first call with
startAt=0
andmaxResults=1
. - Extract
total
from the response and calculate thetotal number of calls
that need to be made to the endpoint, withmaxResults=100
(100 is a maximum allowed number of results). - Create a loop to iterate through the pages. The loop ends when the number of calls made to the endpoint reaches the
total number of calls
, calculated in step 2.
Now, let’s put it all together.
Step 1. Create Connections for the example
- HTTP Connection for the Jira API.
- Database Connection to the PostgreSQL database.
The URL for the HTTP Connection, which returns the Jira ticket for the project TEST, will look like the below (we are using the version of the API for the Jira cloud, so substitute company in the URL on your actual company name). Note that Jira API uses preemptive basic authentication.
https://company.atlassian.net/rest/api/3/searchjql=project%20%3D%20TEST
&startAt={startAt}&maxResults=100
Notice that the value for the startAt
parameter in the URL includes token {startAt}
. The idea that in the loop, we will be replacing token {startAt}
on the actual values: 0
, 100
, 200
, 300
, etc.
For this example, we will be using a database loop, so we will need one more Connection to the temporary database, which will be used to drive the loop.
Step 2. Create the Flows
2.1 Create SQL Flow which is used to create a temporary loop table using SQL below:
create table pages (startAt integer)
2.2 Create JavaScript Flow to populate the loop table:
var javaImports = new JavaImporter(com.toolsverse.etl.core.engine,
com.toolsverse.util, com.toolsverse.config, com.toolsverse.etl.sql.util,
java.util);
with (javaImports) { var pageSize = 100; var params = Extractor.lookup(etlConfig, scenario,
"First", "params", "select total"); var totalNumberOfRecords = params.getFieldValue(params.getRecord(0),
"total"); var numberOfPages = Math.floor(totalNumberOfRecords / pageSize); var remainder = totalNumberOfRecords % pageSize; numberOfPages = totalNumberOfRecords == 0 ? 0 :
(numberOfPages == 0 ? 1 :
(remainder > 0 ? numberOfPages + 1 : numberOfPages)); var connection = etlConfig.getConnectionFactory().
getConnection("Temporary database"); var sql = "insert into pages (startAt) values (?)"; var startAt = 0; for (var page = 0; page <numberOfPages; page++) { SqlUtils.executeSql(connection, sql, startAt); startAt = startAt + pageSize; }
}
2.3 Create a Flow Extract data from the Web Service and load into database
where the source is a web service (Jira endpoint above), and the destination is a PostgreSQL database.
2.4 Create a nested Flow that combines Flows 1, 2, and 3.
2.5 Modify step 3 by configuring a database loop where the Connection is a Temporary database, and the driving SQL is:
select startAt as "startAt" from pages
Download Jira example from etlworks.com
. You can import the example into your Etlworks account and try it out. Don’t forget to edit the PostgreSQL and Jira Connections before executing the Flow.
Download a fully functional example for another paginated API with page number query parameter from the etlworks.com
. You can import the example into your Etlworks account and try it out without changing anything. This Flow pulls the data from the paginated API and loads in the shared Google Sheet.
API with the next marker
For this example, we will be using the UKRI Gateway API endpoint with the following URL:
https://www.ebi.ac.uk/europepmc/webservices/rest/search
?query=italy&cursorMark={NEXT_MARKER}
If the value of the cursorMark
query parameter is *
(star character) the API returns the first page, which contains data and possible next value for the cursorMark
.
<response>
<nextCursorMark>535543435GUI999==</nextCursorMark>
<data>
...
</data>
</reponse>
So, the algorithm to iterate through the pages will look like below:
- Make the first call with
cursorMark=*
. - Get the value for the
cursorMark
from the response by extracting the value of thenextCursorMark
field. - Create a loop to iterate through the pages. The loop ends when the response doesn’t have the
nextCursorMark
.
Now, let’s put it all together.
Step 1. Create Connections for the example
- HTTP Connection for the UKRI Gateway API endpoint above.
- Connection to the shared Google Sheet.
Notice that the value of cursorMark
query parameter is a token {NEXT_MARKER}
. The idea is to replace the token with the next marker, make the API call, and repeat until there is no next marker anymore.
We will be loading data into the shared Google Sheet by calling the API endpoint multiple times, so we will need a Connection to the temporary staging database to dump the data from the API before loading the whole thing into the Google Sheet.
Step 2. Create the Flows
2.1 Create the JavaScript Flow to set the original value of the {NEXT_MARKER}
token as *
(star character).
com.toolsverse.config.SystemConfig.instance().getProperties().put("NEXT_MARKER", "*");
2.2 Create a Flow Extract data from the Web Service and load into database
where the source is a web service and the destination is a temporary database.
2.3 Click MAPPING
> Additional Transformations
and add the following JavaScript as a value for the After Extract transformation:
var nextCursorMark = dataSet.getFieldValue(dataSet.getRecord(0),
"nextCursorMark"); if (com.toolsverse.util.Utils.isNothing(nextCursorMark)) { nextCursorMark = ""; } com.toolsverse.config.SystemConfig.instance().getProperties().put("NEXT_MARKER",
nextCursorMark);
2.4 Create Flow Database to well-known API
where the source is a temporary database and the destination is a shared Google Sheet.
2.5 Create nested Flow that combines Flows 1, 2, 3, and 4.
Modify step 2 by configuring a JavaScript loop, where the JavaScript for the loop is:
var nextCursorMark = com.toolsverse.config.SystemConfig.instance().getProperties().
get("NEXT_MARKER");
value = !com.toolsverse.util.Utils.isNothing(nextCursorMark) ? nextCursorMark : null;
Download a fully functional example for paginated API with next marker from etlworks.com
. You can import the example into your Etlworks account and try it out without changing anything. This Flow pulls the data from the paginated API and loads it in the shared Google Sheet.
Comments
0 comments
Please sign in to leave a comment.