Problem
A real estate company needs to create a flow that loads information about property listings into the database and copies all the images that are associated with the particular property into the bucket in Amazon S3.
Requirements
- The flow must be able to parse the nested XML document returned by the web service, "flatten" it and load into the PostgreSQL database.
- The flow must be able to copy all the images, associated with the particular property into the bucket in Amazon S3.
- The flow should be able to update the existing properties, in addition to adding new properties.
- The flow will be running multiple times a day.
Input Format
The web service returns an XML in the following format:
<listings xmlns="listings-schema">
<property>
<mls_id>53928854</mls_id>
<street_address>1234 Main St</street_address>
<city>Berlin</city>
<other_fields/>
<pictures>
<picture>
<url>http://www.mycompany.com/mls/gen/images/11111.jpg</url>
<picture_caption>The open floorplan shows</picture_caption>
<display_order>1</display_order>
</picture>
<picture>
<url>http://www.mycompany.com/mls/gen/images/22222.jpg</url>
<picture_caption>Backyard view</picture_caption>
<display_order>2</display_order>
</picture>
</pictures>
</property>
...
<property>
<mls_id>123456</mls_id>
<street_address>14567 Main St</street_address>
<city>Pittsburgh</city>
<other_fields/>
<pictures>
<picture>
<url>http://www.mycompany.com/mls/gen/images/11111.jpg</url>
<picture_caption>The open floorplan</picture_caption>
<display_order>1</display_order>
</picture>
<picture>
<url>http://www.mycompany.com/mls/gen/images/22222.jpg</url>
<picture_caption>Backyard view</picture_caption>
<display_order>2</display_order>
</picture>
</pictures>
</property>
</listings>
Solution
The solution includes parsing the XML returned by the web service twice:
- First to load properties into the database.
- Second to create a list of images to upload into S3.
Once the list of images is created, the second flow can copy them to the S3 bucket using a technique called database loop.
Create flow which can load listings in the database and creates a table with images
Step 1. Create an HTTP connection which will be used as a source.
Step 2. Create an XML format with all the default settings, except property "Parse XML Attributes in Root Node" which is disabled.
Step 3. Create a database connection to the staging PostgreSQL database. This connection will be used as a destination.
Step 4. Create a new flow by selecting the "Web Service to database" in the gallery.
Step 5. Add a new source-to-destination transformation with the following parameters:
- Connection (from) - an HTTP connection created in Step 1
- Format (from) - an XML format created in Step 2
- From - listings
- To - listing (the name of the table in the staging database)
- Connection (to) - a database connection created in Step 3
Step 6. Click the MAPPING button and type the following SQL in the Source query field:
select * from listings.property
The SQL "flattens" the nested XML document returned by the web service. Read about using SQL with nested datasets.
Step 7. Click the Create Mapping button and exclude the nested fields, such as pictures
and open houses
. Rename some of the fields if needed.
Step 8. Configure the transformation to MERGE extracted properties with the existing properties in the listing table. Click the "Parameters" tab and set the following properties:
Step 9. Add the second transformation with the following parameters:
- Connection (from) - an HTTP connection created in step 1
- Format (from) - an XML format created in Step 2
- From - listings
- To - listing_image (the name of the table with images in the staging database)
- Connection (to) - a database connection created in step 3
Step 10. Click the MAPPING button and type the following SQL in the Source query field:
select * from listings.property
Step 11. Click the "Complex Transformations" tab and configure the transformation Extract Dimensions:
- Extract Dimensions - pictures,picture
- Columns to Include Together with Dimensions to Extract - mls_id
Step 12. Click the Test Transfromation button to preview the results. It should look like the following:
the elements under the nested <pictures> tag were transformed into the flat table and unique mls_id was added to the list of the fields extracted from the pictures.
Step 14. Configure the transformation to MERGE extracted properties with the existing properties in the listing table. Click the "Parameters" tab and set the following properties:
- Action - MERGE or IfExist.
- Lookup Fields - mls_id,url (a unique identifier of the image which belongs to the property)
Alternatively you can configure the transformation to truncate the listing_image table before loading new images:
- Under the Properties/Before SQL enter the following SQL:
-
truncate table listing_image;
Step 13. Save the transformation.
Create flows to copy images into an Amazon S3 bucket.
Step 1. Create an HTTP connection to load images from. We will be setting the URL dynamically so enter the {image_url}
in the URL field.
Step 2. Create an Amazon S3 connection which will be used as a destination.
Step 3. Create a new flow by selecting the "Copy Files" flow type in the gallery.
Step 4. Add new "transformation" and specify the following parameters:
- Connection (from) - picture
- From - the HTTP connection created in step 1
- To - {file_name}
- Connection (to) - the Amazon S3 connection created in step 2
Step 5. Since it is possible that some of the images to download don't exist anymore, or the URL is wrong, it makes sense to ignore the exceptions when running this flow.
Step 6. Save the flow.
Step 7. Create a new nested flow. Add the flow created in a previous step to the nested flow.
Step 8. Configure a database loop which will iterate through the images in the property_image table and copy each image to the Amazon S3:
- Loop Type - SQL
- Connection - the database connection
- Loop Script - the following SQL
-
the file_name and image_url fields. The same {tokenized} names are used in the Copy Files flow, created in a previous step. Read about parametrization.SELECT RIGHT(url, POSITION('/' in REVERSE(url)) -1) file_name, url image_url from listing_image
- Loop Threads - 10. Read about executing flows in a loop in parallel threads.
Create flow which combines all flows together.
Step 1. Create a new nested flow. Add the flows (steps) in the following order:
- The flow which extracts data from a web service and loads in the staging database.
- The flow which copies images to Amazon S3 in a loop.
Step 2. Save and schedule the flow to run every four hours.
Comments
0 comments
Please sign in to leave a comment.