Problem
A real estate company needs to create a Flow that loads information about property listings into the database and copies all the images that are associated with the particular property into the bucket in Amazon S3.
Requirements
- The Flow must be able to parse the nested XML document returned by the web service, flatten it, and load it into the PostgreSQL database.
- The Flow must be able to copy all the images associated with the particular property into the bucket in Amazon S3.
- The Flow should be able to update the existing properties, in addition to adding new properties.
- The Flow will be running multiple times a day.
Input Format
The web service returns an XML in the following Format:
<listings xmlns="listings-schema">
<property>
<mls_id>53928854</mls_id>
<street_address>1234 Main St</street_address>
<city>Berlin</city>
<other_fields/>
<pictures>
<picture>
<url>http://www.mycompany.com/mls/gen/images/11111.jpg</url>
<picture_caption>The open floorplan shows</picture_caption>
<display_order>1</display_order>
</picture>
<picture>
<url>http://www.mycompany.com/mls/gen/images/22222.jpg</url>
<picture_caption>Backyard view</picture_caption>
<display_order>2</display_order>
</picture>
</pictures>
</property>
...
<property>
<mls_id>123456</mls_id>
<street_address>14567 Main St</street_address>
<city>Pittsburgh</city>
<other_fields/>
<pictures>
<picture>
<url>http://www.mycompany.com/mls/gen/images/11111.jpg</url>
<picture_caption>The open floorplan</picture_caption>
<display_order>1</display_order>
</picture>
<picture>
<url>http://www.mycompany.com/mls/gen/images/22222.jpg</url>
<picture_caption>Backyard view</picture_caption>
<display_order>2</display_order>
</picture>
</pictures>
</property>
</listings>
Solution
The solution includes parsing the XML returned by the web service twice:
- First, to load properties into the database.
- Second, to create a list of images to upload into S3.
Once the list of images is created, the second Flow can copy them to the S3 bucket using a technique called database loop.
Create Flow which can load listings in the database and create a table with images
Step 1. Create an HTTP Connection that will be used as a source.
Step 2. Create an XML Format with all the default settings, except property Parse XML Attributes in Root Node
which is disabled.
Step 3. Create a database Connection to the staging PostgreSQL database. This Connection will be used as a destination.
Step 4. Create a new Flow by selecting the Web Service to database
in the gallery.
Step 5. Add a new source-to-destination transformation with the following parameters:
- Connection (
FROM
): an HTTP Connection created in Step 1 - Format (
FROM
): an XML Format created in Step 2 From
:listings
To
:listings
(the name of the table in the staging database)- Connection (
TO
): a database Connection created in Step 3
Step 6. Click MAPPING
and type the following SQL in the Source query
field:
select * from listings.property
The SQL flattens the nested XML document returned by the web service. Read about using SQL with nested datasets.
Step 7. Click Create Mapping
and exclude the nested fields, such as pictures
and open houses
. Rename some of the fields if needed.
Step 8. Configure the transformation to MERGE
extracted properties with the existing properties in the listing table. Click the Parameters
tab and set the following properties:
Step 9. Add the second transformation with the following parameters:
- Connection (
FROM
): an HTTP Connection created in step 1. - Format (
FROM
): an XML Format created in Step 2. From
:listings
.To
:listing_image
(the name of the table with images in the staging database).- Connection (
TO
): a database Connection created in step 3.
Step 10. Click MAPPING
and type the following SQL in the Source query
field:
select*
fromlistings.property
Step 11. Click the Complex Transformations
tab and configure the transformation Extract Dimensions:
- Extract Dimensions:
pictures,picture
- Columns to Include Together with Dimensions to Extract:
mls_id
Step 12. Click Test Transformation
to preview the results. It should look like the following:
The elements under the nested
<pictures>
tag were transformed into the flat table and unique mls_id
was added to the list of the fields extracted from the pictures.
Step 13. Configure the transformation to MERGE
extracted properties with the existing properties in the listing table. Click the Parameters
tab and set the following properties:
Action
:MERGE
orIfExist
.Lookup Fields
:mls_id
, URL (a unique identifier of the image which belongs to the property).
Alternatively, you can configure the transformation to truncate the listing_image
table before loading new images:
- Under the
Properties/Before SQL
enter the following SQL: -
truncate table listing_image;
Step 14. Save the transformation.
Create Flows to copy images into an Amazon S3 bucket
Step 1. Create an HTTP Connection to load images from. We will be setting the URL dynamically, so enter the {image_url}
in the URL
field.
Step 2. Create an Amazon S3 Connection, which will be used as a destination.
Step 3. Create a new Flow by selecting the Copy Files
Flow type in the gallery.
Step 4. Add new transformation and specify the following parameters:
- Connection (
FROM
): picture. From
: the HTTP Connection created in step 1.To
:{file_name}
.- Connection (
TO
): the Amazon S3 Connection created in step 2.
Step 5. Since it is possible that some of the images to download don't exist anymore, or the URL is wrong, it makes sense to ignore the exceptions when running this Flow.
Step 6. Save the Flow.
Step 7. Create a new nested Flow. Add the Flow created in the previous step to the nested Flow.
Step 8. Configure a database loop that will iterate through the images in the property_image
table and copy each image to the Amazon S3:
Loop Type
: SQL.Connection
: the database Connection.Loop Script
: the following SQL.-
theSELECT RIGHT(url, POSITION('/' in REVERSE(url)) -1) file_name, url image_url from listing_image
file_name
andimage_url
fields. The same{tokenized}
names are used in theCopy Files
Flow, created in a previous step. Read about parametrization. Loop Threads
: 10. Read about executing Flows in a loop in parallel threads.
Create Flow which combines all Flows together
Step 1. Create a new nested Flow. Add the Flows (steps) in the following order:
- The Flow which extracts data from a web service and loads in the staging database.
- The Flow which copies images to Amazon S3 in a loop.
Step 2. Save and schedule the Flow to run every four hours.
Comments
0 comments
Please sign in to leave a comment.