Overview
It is possible that your source CSV file is not properly formatted, does not conform to the CSV Format standard, or has other issues.
In the Etlworks Integrator, you can modify the properties of the CSV Format so it could be parsed regardless of how broken or non-standard it is.
Process
CSV file with empty lines
Example:
first,last,age
john,doe,23
...empty line...
mad,max,45
When configuring the CSV Format, select the Skip Empty Rows
check box.
The header does not conform to the CSV Format
just a header
john,doe,23
mad,max,45
When configuring the CSV Format, select the Skip First Row
check box and also (typically) Use First Row for Data
.
Some of the lines in the file have a different number of columns
first,last,age
john,doe,23
wrong,number,of,columns
mad,max,45
When configuring the CSV Format, select the Skip not Properly Formatted Rows
check box.
The header has fewer columns than data rows
first,last
john,doe,23,01/01/2000
mad,max,45,01/03/1977
When configuring CSV Format, select the Document has extra data columns
check box and deselect Use First Row for Data
.
Some of the data rows have a fewer or greater number of columns than a header
first,last,age
joe,doe,42
simba,dog
jane,doe,26,Pittsburgh
When configuring the CSV Format, enable one or both options below:
Skip rows with fewer columns than header
: if this option is enabled, the system skips rows that have fewer columns than a header row.Skip rows with more columns than header
: if this option is enabled, the system skips rows that have more columns than a header row.
Reading CSV files with multiline records
Your source CSV file may have records (rows) that span multiple lines.
first,last,description,dob
Joe,Doe,"just a
guy","0102000"
Brad,Pit,"an actor","0101965"
As you can see, for some of the records, the value of the Description
field occupies more than one line.
To handle this, enable Has multiline records
when configuring the CSV Format.
The CSV file has a completely irregular structure and cannot be parsed by a standard parser
In the example below, the first line must be ignored, the second line contains a "header" with one column, while lines from the third downward contain "items" with a different number of columns:
HeaderInfo
XYZ 2|LLL001|MMM0124386|BULK|17/01/11|0000002205|TIGER WHEEL|SHOPPING CENTRE BRITS NORTH WEST|0250|0122527777|2|
77777PYT12P515012MBK|TEST 9.0X17 5/150 ET12|4|5013562|16/12/29|3|554981
77777ROG20P611466MBK|TEST 9.0X17 6/114 ET20|4|5013562|16/12/29|6|554981
Use a filter
when configuring the CSV Format. Read how to create a JavaScript code for the filter.
Example of a filter
Let's assume that the source file has a structure like the one above. Let's also assume that we need to read the CSV file and split it on a "header" which contains line number 2:
XYZ 2|LLL001|MMM0124386|BULK|17/01/11|0000002205|TIGER WHEEL|SHOPPING CENTRE BRITS NORTH WEST|0250|0122527777|2|
and "items," from line 3 downward look like this:
77777PYT12P515012MBK|TEST 9.0X17 5/150 ET12|4|5013562|16/12/29|3|554981
77777ROG20P611466MBK|TEST 9.0X17 6/114 ET20|4|5013562|16/12/29|6|554981
The end goal is to parse the source file and store the "header" part of the file in an order
table in the database and the "items" part in an item
table.
To parse files like the one above, we will create two different CSV Formats, one for the "header" and one for the "items."
Step 1. Create a Format for the "header":
Delimiter
: | (pipe).Use first Row for Data
: if selected, it tells the Etlworks Integrator that there is no row with column names anywhere, so it must use the default column names:field1
,field2
, etc.Skip First Row
: if selected, it tells the Etlworks Integrator that the first row, containing the wordHeaderInfo
, must be skipped as if it never existed.Transformation type
: selectFilter
(default).Filter or Preprocessor
: the JavaScript code below tells the Etlworks Integrator tostop
parsing the file after line 1.
Since the first line is completely ignored, the real first line starts from the former line 2: XYZ 2...
-
if (filter.getIndex() > 1) { filter.setCode(TextFilter.TextFilterCode.STOP); }
Step 2. Create a Format for the "items":
Delimiter
: | (pipe).Use first Row for Data
: if selected, it tells the Etlworks Integrator that there is no row with column names anywhere, so it must use the default column names:field1
,field2
, etc.Skip First Row
: if selected, it tells the Etlworks Integrator that the first row containing the wordHeaderInfo
, must be skipped as if it never existed.Filter
: the JavaScript code below tells the Etlworks Integrator toskip
(REJECT) the first line.
Since the first line is completely ignored, the real first line starts from the former line 2: XYZ 2...
-
if (filter.getActualIndex() == 1) { filter.setCode(TextFilter.TextFilterCode.REJECT); filter.setIndex(1); }
Comments
0 comments
Please sign in to leave a comment.