Split file – Etlworks Support

Overview

If a source file is large, especially if it is an XML file, it makes sense to split it into multiple smaller files and then process them one by one in a loop.

Note: While splitting the files, Etlworks consumes only a minimum amount of RAM and CPU cycles, which allows it to work with very large files without blocking other tasks.

Why would you want to split large files?

If the source file is so large, it cannot be parsed due to memory limitations.
If there is a source-to-destination transformation, for example, read from XML file, create JSON and hit HTTP endpoint, and the payload created as a part of the transformation is too large for the destination.
If the source file contains information, you don't really need, except for a few repeating segments.

Create Flow

Step 1. Start creating the Flow that splits files by opening the Flows window, clicking +, and typing in split. Select a Flow based on the file type (XML, JSON, CSV, etc.).

Step 2. Continue by selecting a source Connection (FROM), and a file name (or a wildcard file name).

Step 3. Continue by selecting a destination Connection (TO).

Step 4. Continue by specifying parameters. Click MAPPING and select the Parameters tab.

Note: The parameters are different for each file type.

Split XML files

Unlike other file types, parsing and processing XML files requires a lot of memory. It is typically a good idea, and if the file is particularly large, a requirement, to split the original file into multiple smaller files, so they can easily fit into memory.

Important: The XML files must have repeating segments in order to be splittable.

This is an example of XML with repeating segments, in this case, called <Entries>:

<?xml version="1.0" ?>
<Transaction>
     <feed>1</feed>
     <source>http</source> 
     <!-- repeating segments -->  
     <Entries>
          <File>N71738</File>
          <HBLNo>3470312436</HBLNo>
          <BOENumber>5064594</BOENumber>
     </Entries>
     <Entries>
          <File>N71738</File>
          <HBLNo>3470312436</HBLNo>
          <BOENumber>11111</BOENumber>
     </Entries>
</Transaction>

An example of XML without repeating segments:

<?xml version="1.0" ?>
<Transaction>
     <feed>1</feed>
     <source>http</source> 
     <File>N71738</File>
     <Data>
          <File>N71738</File>
     </Data>
     <HBLNo>3470312436</HBLNo>
     <BOENumber>5064594</BOENumber>
</Transaction>

To split XML files, select the Split XML files Flow type. Follow Steps 1 to 4 above.

The following parameters are available:

Maximum number of repeating segments: the maximum number of repeating segments to save in one file.
Paths for the repeating segments in XML: key-value pairs, where the key is an XML path for the repeating segment, for example, Transaction/Entries, and the value is a suffix to add to the file name, for example transactions. If the value is empty, a default suffix will be used, which is calculated by replacing / in the path with _. There can be multiple, different repeating segments in a single XML file, each identified by its own path.
Include common segment: if this property is enabled, the system will try to detect a common, non-repeating segment and include it in the output file. Enabling this option can significantly increase processing time. In the example below, the common segment includes the tags <feed> and <source>:

<?xml version="1.0" ?>
<Transaction>
     <!-- common segment --> 
     <feed>1</feed>
     <source>http</source> 
     <!-- repeating segments -->  
     <Entries>
          <File>N71738</File>
          <HBLNo>3470312436</HBLNo>
          <BOENumber>5064594</BOENumber>
     </Entries>
     <Entries>
          <File>N71738</File>
          <HBLNo>3470312436</HBLNo>
          <BOENumber>11111</BOENumber>
     </Entries>
</Transaction>

Other parameters are the same as the parameters for the copy files Flow.

Split JSON files

To split JSON files, select the Split JSON files Flow type. Follow the same steps as when creating a Flow to split XML files.

Split CSV files

Splitting a CSV file is as simple as breaking it into chunks using the end-of-line character (default), configurable delimiter, or regular expression as a separator. This process automatically detects a row with column names (a header) and replicates it in each chunk.

To split CSV files, select the Split CSV files Flow type. Follow Steps 1 to 4 above.