Overview
XML is common in enterprise data exchange. Supports deeply nested elements and attributes.
When to use this Format
Etlworks can automatically parse and create practically any XML document. You can use additional parameters to hint the XML parser on how to deal with the XML nodes and attributes when reading and creating XML documents.
Use JSON Format when configuring a source-to-destination transformation that reads or writes XML documents.
Read how to work with nested JSON and XML files in Etlworks.
Create Format
To create a new XML Format, go to Connections, select the Formats tab, click Add Format, type in xml in the Search field, and select XML. Note that it is not the same as an XML Data Set.
Parameters
- XML Document: when this option is selected (the default), the parser automatically adds the root tag xml to the document. Always select it if you want your XML documents to be created with an encoding different from UTF-8.
-
<?xml version="1.0" encoding="UTF-8"?>
- Start Element Name: the value of this parameter is used as an XML root element.
When creating XML document, the Start Element Name is used as a root node. For example, when the Start Element Name is set to Patient:
-
<Patient> the rest of the xml document </Patient>
When reading XML documents, the Start Element Name is used as a hint for the XML parser to automatically create a nested data set with the name equal to the value of this field.
- XML Version: the XML standard version. The default is 1.0.
- Namespaces: all applicable XML namespaces. Namespaces can only be set if Start Element Name is not empty. For example, when the Start Element Name is set to Patient and Namespaces is set to xmlns="http://www.meditlive.com/datacontract xmlns:i="http://www.w3.org/2001/XMLSchema-instance":
-
<Patient xmlns="http://www.meditlive.com/datacontract" xmlns:i="http://www.w3.org/2001/XMLSchema-instance"> the rest of the xml document </Patient>
- Row Tag: the tag used for repeating rows.
When reading XML like below:
-
<Patients> <Patient> <first_name>Joe</first_name> <last_name>Doe</last_name> <dob>01/01/2001</dob> </Patient> .... <Patients>
By default, the system creates nested data sets for each patient row. If you wish to flatten the data set you can set the Row Tag to patient. As a result, the flat data set with the following rows and columns will be created:
-
first_name,last_name,dob joe,doe,01/01/2001 .... jane,doe,01/02/2002
When creating an XML document, the value of the Row Tag field is used literally to create new repeating XML nodes. For example, with the Row Tag set to ROW:
-
<Patients> <ROW> </ROW> <ROW> </ROW> <ROW> </ROW> <Patients>
- Null Attribute: if set, a tag with a null value will be created using this attribute.
For example (with no Null Attribute):
-
<MiddleName />
Another example (with the Null Attribute set to i:nil="true"):
-
<MiddleName i:nil="true"/>
- Value Attribute: if set, the values are set using specific attributes.
For example (with no Value Attribute):
-
<FirstName>Joe</FirstName>
Another example (with the Value Attribute set to value):
-
<FirstName value="Joe"/>
- Template: a template in the XML Format. If this field is not empty, Etlworks will use it to populate column names and data types. Template is an optional field. For example: <root><FirstName/><LastName/><Dob/></root>.
- Column names compatible with SQL: convert column names to SQL-compatible column names by removing all characters, except alphanumeric and spaces.
- Parse XML Attributes: enabling this option will force the system to parse XML attributes: <tag attr="value">.
- Parse XML Attributes in Root Node: disable this flag if you don't want the system to parse attributes in the root XML node. This feature only works when Parse XML Attributes is enabled. An example of the attributes in the root node: <data xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1">.
- Parse CDATA section: disable this flag if you know for sure that there are no CDATA sections in the XML to parse. This will improve parsing performance.
- Parse Comments: if this option is enabled, the system will add a field with a suffix -comments for each node that has comments.
-
Ignore XML Attributes if Node has Value and Attributes:
-
Here is a node with the attribute and value:
<node attr="attr_value">value</node>When this option is enabled the parser will read the value and will ignore the attribute.
-
Ignore XML Attributes if Node has Value and Attributes:
-
Here is a node with the attribute and value:
<node attr="attr_value">value</node>When this option is enabled the parser will read the attribute and will ignore the value.
- Transformation type: the default is query, which is equal to applying an XSL Style Sheet, but you can also select Preprocessor. Read more about using a preprocessor to modify the content of the source document.
- XSL Style Sheet or Preprocessor: an XSL stylesheet can transform an XML document into the desired Format. An XSL stylesheet is applied when reading and writing XML documents. Read more about using XSL stylesheets to transform XML.
- Encoding: XML document encoding. The default is UTF-8.
- All fields are strings: if this option is enabled (it is disabled by default) the system will create all fields with a string data type.
- Date and Time Format: a Format for timestamps (date+time).
- Date Format: a Format for date (date only, no time).
- Time Format: a Format for time (time only, no date).
- Parse Dates: if this option is enabled, and date or time value is not recognized as one of the Formats defined above, Etlworks will try to parse it using one of the well-known date+time Formats.
- Trim Strings: if this option is enabled, Etlworks will trim leading and trailing white spaces from the value.
- Treat 'null' as null: if this option is enabled, Etlworks will treat string values equal to null as actual nulls (no value).
-
Do not create parent node for nested arrays
By default Etlworks XML connector creates a parent node for nested XML arrays. The name of the node is the same as the name of the first node in the array.
<data>
<prescriber>Mr Smith</prescriber>
<patient>
<patient>John Doe</patient>
<patient>Jane Doe</patient>
</patient>
</data>If this option is enabled (it is disabled by default) the connector will not create the parent node.
<data>
<prescriber>Mr Smith</prescriber>
<patient>John Doe</patient>
<patient>Jane Doe</patient>
</data>
Use XSLT when reading and writing XML documents
Use XSLT when reading XML documents
You have received a source XML from a web service and need to load it into a database. Your source XML is a nested data object, like the one below:
<?xml version="1.0" encoding="UTF-8"?>
<GoodsReceivedIntoBond>
<GoodsReceivedIntoBondHeader WarehouseNo="CTN 12" CustomerCode="Cust1"
DocumentNo="00123" DocumentDate="2017-02-08" NumberOfLine="3" DocumentXRef="12345">
<GoodsReceivedIntoBondLines>
<StockCode>B3B4</StockCode>
<StockQuantity>60.0000</StockQuantity>
</GoodsReceivedIntoBondLines>
<GoodsReceivedIntoBondLines>
<StockCode>B5GT</StockCode>
<StockQuantity>60.0000</StockQuantity>
</GoodsReceivedIntoBondLines>
<GoodsReceivedIntoBondLines>
<StockCode>B121W</StockCode>
<StockQuantity>360.0000</StockQuantity>
</GoodsReceivedIntoBondLines>
</GoodsReceivedIntoBondHeader>
<GoodsReceivedIntoBondHeader WarehouseNo="CTN 13" CustomerCode="Cust2"
DocumentNo="00456" DocumentDate="2017-02-08" NumberOfLine="2" DocumentXRef="6789">
<GoodsReceivedIntoBondLines>
<StockCode>B7C7</StockCode>
<StockQuantity>16.0000</StockQuantity>
</GoodsReceivedIntoBondLines>
<GoodsReceivedIntoBondLines>
<StockCode>B12345</StockCode>
<StockQuantity>16.0000</StockQuantity>
</GoodsReceivedIntoBondLines>
</GoodsReceivedIntoBondHeader>
</GoodsReceivedIntoBond>
Most databases store data in flat tables with rows and columns. So, ideally, you would like to denormalize a nested data object, like the one above, and transform it into a flat table:
Process
In Etlworks, all you need to do to denormalize a nested data object and transform it into a flat table is to use an XML Format and an XSL stylesheet by following the steps below:
Step 1. Create an XSL stylesheet that transforms a nested-source XML, like the one above, into a denormalized flat XML, like the one below:
<?xml version="1.0" encoding="UTF-8"?>
<result>
<ROW>
<WarehouseNo>WN</WarehouseNo>
<CustomerCode>CC</CustomerCode>
<DocumentNo>DN</DocumentNo>
<DocumentDate>DD</DocumentDate>
<NumberOfLine>NOL</NumberOfLine>
<DocumentXRef>XREF</DocumentXRef>
<StockCode>SC</StockCode>
<StockQuantity>SQ</StockQuantity>
</ROW>
<ROW>
.....
</ROW>
</result>
For this solution, we will be using the following XSL stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:template match="/">
<xsl:variable name="var1_initial" select="."/>
<result>
<xsl:for-each select="GoodsReceivedIntoBond/GoodsReceivedIntoBondHeader">
<xsl:variable name="var2_cur" select="."/>
<xsl:for-each select="GoodsReceivedIntoBondLines">
<xsl:variable name="var3_cur" select="."/>
<ROW>
<WarehouseNo>
<xsl:value-of select="$var2_cur/@WarehouseNo"/>
</WarehouseNo>
<CustomerCode>
<xsl:value-of select="$var2_cur/@CustomerCode"/>
</CustomerCode>
<DocumentNo>
<xsl:value-of select="number($var2_cur/@DocumentNo)"/>
</DocumentNo>
<DocumentDate>
<xsl:value-of select="$var2_cur/@DocumentDate"/>
</DocumentDate>
<NumberOfLine>
<xsl:value-of select="number($var2_cur/@NumberOfLine)"/>
</NumberOfLine>
<DocumentXRef>
<xsl:value-of select="number($var2_cur/@DocumentXRef)"/>
</DocumentXRef>
<StockCode>
<xsl:value-of select="StockCode"/>
</StockCode>
<StockQuantity>
<xsl:value-of select="number(StockQuantity)"/>
</StockQuantity>
</ROW>
</xsl:for-each>
</xsl:for-each>
</result>
</xsl:template>
</xsl:stylesheet>
Note the <ROW>...</ROW> tag in the stylesheet.
Step 2. Create an XML Format in Etlworks, which uses the XSL stylesheet above.
Notice the parameters below:
- Row Tag: ROW, in our case.
- XPath Query or XSL Style Sheet or Preprocessor: XSL stylesheet above, in our case.
Step 3. Use the Format created in Step 2 as your source (FROM) Format.
Use XSLT when writing XML documents
You need to create an XML document like the one below.
<Transaction>
<Invoice>AR</Invoice>
<InvoiceNo>654058</InvoiceNo>
<InvoiceType>RI</InvoiceType>
<AccountNo>1601001223</AccountNo>
<InvoiceDate>20160429</InvoiceDate>
<BranchId>64</BranchId>
<Department>15</Department>
<FileRef>F21557</FileRef>
<InvoiceTotal>13469.02</InvoiceTotal>
<InvoiceLine>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX6636-RFC</TransactionDesc>
<Amount>642.84</Amount>
<Vat>90</Vat>
</InvoiceLine>
<InvoiceLine>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX308707-RFC</TransactionDesc>
<Amount>60.97</Amount>
<Vat>8.53</Vat>
</InvoiceLine>
<InvoiceLine>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX307851-RFC</TransactionDesc>
<Amount>233.52</Amount>
<Vat>32.69</Vat>
</InvoiceLine>
</Transaction>
Your source is a flat file, database table, or dataset produced by executing an SQL query, which looks like the one below. To simplify this example, we will not be explaining how to select data from the database or read a flat file.
Here are the columns in the source:
Process
Step 1. Create a source Connection.
Step 2. Create a source Format if needed. This step is optional if we are going to read data from a database.
Step 3. Create a destination Connection.
Step 4. Create an XML Format for the destination. When creating a Format:
- Uncheck XML Document.
- Set Start Element Name to Transaction.
- Set Row Tag to InvoiceLine.
Step 5. Create a Flow where the source (FROM) is a Connection, and (optionally) create a Format using the source created in Steps 1-2, and choices for the destination (TO) to match the Connection and Format created in Steps 3-4.
Step 6. Execute the Flow and check the output. It should look like this:
<Transaction>
<InvoiceLine>
<Invoice>AR</Invoice>
<InvoiceNo>654058</InvoiceNo>
<InvoiceType>RI</InvoiceType>
<AccountNo>1601001223</AccountNo>
<InvoiceDate>20160429</InvoiceDate>
<BranchId>64</BranchId>
<Department>15</Department>
<FileRef>F21557</FileRef>
<InvoiceTotal>13469.02</InvoiceTotal>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX6636-RFC</TransactionDesc>
<Amount>642.84</Amount>
<Vat>90.0</Vat>
</InvoiceLine>
<InvoiceLine>
<Invoice>AR</Invoice>
<InvoiceNo>654058</InvoiceNo>
<InvoiceType>RI</InvoiceType>
<AccountNo>1601001223</AccountNo>
<InvoiceDate>20160429</InvoiceDate>
<BranchId>64</BranchId>
<Department>15</Department>
<FileRef>F21557</FileRef>
<InvoiceTotal>13469.02</InvoiceTotal>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX308707-RFC</TransactionDesc>
<Amount>60.97</Amount>
<Vat>8.53</Vat>
</InvoiceLine>
<InvoiceLine>
<Invoice>AR</Invoice>
<InvoiceNo>654058</InvoiceNo>
<InvoiceType>RI</InvoiceType>
<AccountNo>1601001223</AccountNo>
<InvoiceDate>20160429</InvoiceDate>
<BranchId>64</BranchId>
<Department>15</Department>
<FileRef>F21557</FileRef>
<InvoiceTotal>13469.02</InvoiceTotal>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX307851-RFC</TransactionDesc>
<Amount>233.52</Amount>
<Vat>32.69</Vat>
</InvoiceLine>
</Transaction>
As you can see, we are almost there, but not quite. We still need to move the Invoice... fields outside of the <InvoiceLine> block. We will use an XLT transformation to reformat the XML into the expected format.
Step 7. Modify the XML Format created in Step 4 by adding the following XSL style sheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs">
<xsl:output method="xml" encoding="UTF-8" indent="yes" omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:variable name="var1_initial" select="."/>
<Transaction>
<Invoice><xsl:value-of select="Transaction/InvoiceLine/Invoice"/></Invoice>
<InvoiceNo><xsl:value-of select="Transaction/InvoiceLine/InvoiceNo"/></InvoiceNo>
<InvoiceType><xsl:value-of select="Transaction/InvoiceLine/InvoiceType"/></InvoiceType>
<AccountNo><xsl:value-of select="Transaction/InvoiceLine/AccountNo"/></AccountNo>
<InvoiceDate><xsl:value-of select="Transaction/InvoiceLine/InvoiceDate"/></InvoiceDate>
<BranchId><xsl:value-of select="Transaction/InvoiceLine/BranchId"/></BranchId>
<Department><xsl:value-of select="Transaction/InvoiceLine/Department"/></Department>
<FileRef><xsl:value-of select="Transaction/InvoiceLine/FileRef"/></FileRef>
<InvoiceTotal><xsl:value-of select="Transaction/InvoiceLine/InvoiceTotal"/></InvoiceTotal>
<xsl:for-each select="Transaction/InvoiceLine">
<xsl:variable name="var2_cur" select="."/>
<InvoiceLine>
<BillingCode>
<xsl:value-of select="BillingCode"/>
</BillingCode>
<TransactionDesc>
<xsl:value-of select="TransactionDesc"/>
</TransactionDesc>
<Amount>
<xsl:value-of select="number(Amount)"/>
</Amount>
<Vat>
<xsl:value-of select="number(Vat)"/>
</Vat>
</InvoiceLine>
</xsl:for-each>
</Transaction>
</xsl:template>
</xsl:stylesheet>
Step 8. Execute the Flow again. This time, the output will look like the following, which is exactly what we need:
<Transaction>
<Invoice>AR</Invoice>
<InvoiceNo>654058</InvoiceNo>
<InvoiceType>RI</InvoiceType>
<AccountNo>1601001223</AccountNo>
<InvoiceDate>20160429</InvoiceDate>
<BranchId>64</BranchId>
<Department>15</Department>
<FileRef>F21557</FileRef>
<InvoiceTotal>13469.02</InvoiceTotal>
<InvoiceLine>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX6636-RFC</TransactionDesc>
<Amount>642.84</Amount>
<Vat>90</Vat>
</InvoiceLine>
<InvoiceLine>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX308707-RFC</TransactionDesc>
<Amount>60.97</Amount>
<Vat>8.53</Vat>
</InvoiceLine>
<InvoiceLine>
<BillingCode>G37T</BillingCode>
<TransactionDesc>NEX307851-RFC</TransactionDesc>
<Amount>233.52</Amount>
<Vat>32.69</Vat>
</InvoiceLine>
</Transaction>
Configure the XML parser
Configure XML parser
You can configure XML Format to parse (or not to parse) the specific parts of the XML.
Parse XML attributes
By default, if the source XML contains attributes, such as <node attribute="value">, the attributes will not be automatically parsed. Enable parsing attributes by creating a new XML Format and selecting the flag Parse XML Attributes.
Not parsing XML attributes in the root node
If the flag Parse XML Attributes for the XML Format is enabled, Etlworks will parse the attributes in all nodes, including the root node. Frequently, the attributes in the root node are in fact XML schema(s), such as the following:
<data xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
To disable parsing the root attributes, create a new XML Format and disable the flag Parse XML Attributes in Root Node.
Note: This flag has no effect on parsing if Parse XML Attributes is disabled.
Not parsing CDATA section
By default, Etlworks will parse the CDATA sections in XML:
<name>
<![CDATA[this is a test]]>
</name>
Disable this flag if you are absolutely sure that there are no CDATA sections in the XML to parse. This will enable the XML to be parsed slightly faster.
Parse comments
By default, Etlworks ignores comments when parsing the source XML documents.
<!-- this is comments -->
<name>
value
</name>
If you enable the flag Parse Comments the system will add a field with a suffix -comments for each node with comments.
Using the example above, the two fields will be created: name=value and name-comments=this is comments.
Ignore XML Attributes if Node has Value and Attributes
Here is a node with the attribute and value:
<node attr="attr_value">value</node>
When this parameter is enabled, the parser will read the value and will ignore the attribute.
Ignore Node Value if Node has Value and Attributes
Here is a node with the attribute and value:
<node attr="attr_value">value</node>
When this parameter is enabled, the parser will read the attribute and will ignore the value.
Do not create parent node for nested arrays
By default, Etlworks XML connector creates a parent node for nested XML arrays. The name of the node is the same as the name of the first node in the array.
<data>
<prescriber>Mr Smith</prescriber>
<patient>
<patient>John Doe</patient>
<patient>Jane Doe</patient>
</patient>
</data>
If this option is enabled (it is disabled by default), the connector will not create the parent node.
<data>
<prescriber>Mr Smith</prescriber>
<patient>John Doe</patient>
<patient>Jane Doe</patient>
</data>
Read XML files (version 2.x parser)
Problem
When reading XML files version 2.x
<?xml version="2.0.2"..
the ETL flow generates an exception:
SAXParseException; lineNumber: 1; columnNumber: 22; XML version "2.0.2" is not supported
Solution
Assuming that the XML data you are importing starts with an XML prolog like the following:
<?xml version="2.0.2" encoding="UTF-8"?>
In order to process it, the version declared in the source can be replaced with 1.0 by using the JavaScript preprocessor configured for source XML format:
message.replace('<?xml version="2.0.2"', '<?xml version="1.0"');