Overview
If a source file is large, especially if it is an XML file, it makes sense to split it into multiple smaller files and then process them one by one in a loop.
While splitting the files, Etlworks consumes only a minimum amount of RAM and CPU cycles, which allows it to work with very large files without blocking other tasks.
Why would you want to split large files?
- If the source file is so large, it cannot be parsed due to memory limitations.
- If there is a source-to-destination transformation, for example,
read from XML file
,create JSON and hit HTTP endpoint
, andthe payload created as a part of the transformation is too large for the destination
. - If the source file contains information, you don't really need, except for a few repeating segments.
Create Flow
Step 1. Start creating the Flow that splits files by opening the Flows
window, clicking +
, and typing in split
. Select a Flow based on the file type (XML
, JSON
, CSV
, etc.).
Step 2. Continue by selecting a source Connection (FROM
), and a file name (or a wildcard file name).
Step 3. Continue by selecting a destination Connection (TO
).
Step 4. Continue by specifying parameters. Click MAPPING
and select the Parameters
tab.
The parameters are different for each file type.
Split XML files
Unlike other file types, parsing and processing XML files requires a lot of memory. It is typically a good idea, and if the file is particularly large, a requirement, to split the original file into multiple smaller files, so they can easily fit into memory.
The XML files must have repeating segments in order to be splittable.
This is an example of XML with repeating segments, in this case, called <Entries>
:
<?xml version="1.0" ?>
<Transaction>
<feed>1</feed>
<source>http</source>
<!-- repeating segments -->
<Entries>
<File>N71738</File>
<HBLNo>3470312436</HBLNo>
<BOENumber>5064594</BOENumber>
</Entries>
<Entries>
<File>N71738</File>
<HBLNo>3470312436</HBLNo>
<BOENumber>11111</BOENumber>
</Entries>
</Transaction>
An example of XML without repeating segments:
<?xml version="1.0" ?>
<Transaction>
<feed>1</feed>
<source>http</source>
<File>N71738</File>
<Data>
<File>N71738</File>
</Data>
<HBLNo>3470312436</HBLNo>
<BOENumber>5064594</BOENumber>
</Transaction>
To split XML files, select the Split XML files
Flow type. Follow Steps 1 to 4 above.
The following parameters are available:
Maximum number of repeating segments
: the maximum number of repeating segments to save in one file.Paths for the repeating segments in XML
: key-value pairs, where the key is an XML path for the repeating segment, for example,Transaction/Entries
, and the value is a suffix to add to the file name, for exampletransactions
. If the value is empty, a default suffix will be used, which is calculated by replacing/
in the path with_
. There can be multiple, different repeating segments in a single XML file, each identified by its own path.Include common segment
: if this property is enabled, the system will try to detect a common, non-repeating segment and include it in the output file. Enabling this option can significantly increase processing time. In the example below, the common segment includes the tags<feed>
and<source>
:
-
<?xml version="1.0" ?> <Transaction> <!-- common segment --> <feed>1</feed> <source>http</source> <!-- repeating segments --> <Entries> <File>N71738</File> <HBLNo>3470312436</HBLNo> <BOENumber>5064594</BOENumber> </Entries> <Entries> <File>N71738</File> <HBLNo>3470312436</HBLNo> <BOENumber>11111</BOENumber> </Entries> </Transaction>
- Other parameters are the same as the parameters for the copy files Flow.
Split JSON files
To split JSON files, select the Split JSON files
Flow type. Follow the same steps as when creating a Flow to split XML files.
Split CSV files
Splitting a CSV file is as simple as breaking it into chunks using the end-of-line character (default), configurable delimiter, or regular expression as a separator. This process automatically detects a row with column names (a header) and replicates it in each chunk.
To split CSV files, select the Split CSV files
Flow type. Follow Steps 1 to 4 above.
The following parameters are available:
Delimiter
: a string that is used as a delimited between lines. The default is end-of-line (EOL).Regular Expression for Delimiter
: set this parameter to split CSV file using a regular expression (in opposite to hard-coded Delimiter). When Regular Expression is configured the value set inDelimiter
is still used to separate lines.Maximum number of rows in the file
: the maximum number of rows in a single file, excluding the header row of column names (called the 'columns' row).Source file has 'columns' row
: if this option is enabled, it is assumed that the first row in the source CSV file contains column names (the header).- Other parameters are the same as parameters for the copy files Flow.
Comments
0 comments
Please sign in to leave a comment.