Overview
It is possible that your source CSV file is not properly formatted, does not conform to the CSV Format standard, or has other issues.
In Etlworks, you can modify the properties of the CSV Format so it could be parsed regardless of how broken or non-standard it is.
Use cases
CSV file with empty lines
Example:
first,last,age
john,doe,23
...empty line...
mad,max,45
When configuring the CSV Format, select the Skip Empty Rows
check box.
The header does not conform to the CSV Format
just a header
john,doe,23
mad,max,45
When configuring the CSV Format, select the Skip First Row
check box and also (typically) Use First Row for Data
.
Some of the lines in the file have a different number of columns
first,last,age
john,doe,23
wrong,number,of,columns
mad,max,45
When configuring the CSV Format, select the Skip not Properly Formatted Rows
check box.
The header has fewer columns than the data rows
first,last
john,doe,23,01/01/2000
mad,max,45,01/03/1977
When configuring CSV Format, select the Document has extra data columns
check box and deselect Use First Row for Data
.
Some of the data rows have a fewer or greater number of columns than a header
first,last,age
joe,doe,42
simba,dog
jane,doe,26,Pittsburgh
When configuring the CSV Format, enable one or both options below:
Skip rows with fewer columns than header
: if this option is enabled, the system skips rows that have fewer columns than a header row.Skip rows with more columns than header
: if this option is enabled, the system skips rows that have more columns than a header row.
Reading CSV files with multiline records
Your source CSV file may have records (rows) that span multiple lines.
first,last,description,dob
Joe,Doe,"just a
guy","0102000"
Brad,Pit,"an actor","0101965"
As you can see, for some of the records, the value of the Description
field occupies more than one line.
To handle this, enable Has multiline records
when configuring the CSV Format.
CSV file has non-standard BOM character
The special character known as the byte order marking (BOM) character is commonly found as the first line of a CSV file.
If Ignore Byte order mark character (BOM)
is enabled for the CSV Format, the connector will attempt to detect the BOM character in the first line, and if it exists, remove it by executing line.substring(bom_char_size)
. The actual size of the BOM character depends on the file encoding but in some cases, the wrong BOM is used with an otherwise correctly encoded file (very common for UTF-8 encoded files), so the connector truncates more than it should. If you know the size of the BOM character, set BOM size
to the positive value equal to the size of the BOM. The most common size for UTF-8 encoded files is 1.
CSV file with non standard character used to escape double quotes
Some CSV files may use a non standard character to escape enclosing double quotes, for example \
:
"Title","Author","Year"
"The Catcher in the Rye, Special Edition","J.D. Salinger","1951"
"\"To Kill a Mockingbird\" by Harper Lee","Harper Lee","1960"
"1984","George Orwell","1949"
You can configure the escape character for the specific CSV format. This parameter is only used when parsing the CSV or Fixed length text file.
If this parameter is not set then a double-quote appearing inside a string is assumed to be escaped by preceding it with another double quote.
The CSV file has a completely irregular structure and cannot be parsed by a standard parser
In the example below, the first line must be ignored, the second line contains a "header" with one column, while lines from the third downward contain "items" with a different number of columns:
HeaderInfo
XYZ 2|LLL001|MMM0124386|BULK|17/01/11|0000002205|TIGER WHEEL|SHOPPING CENTRE BRITS NORTH WEST|0250|0122527777|2|
77777PYT12P515012MBK|TEST 9.0X17 5/150 ET12|4|5013562|16/12/29|3|554981
77777ROG20P611466MBK|TEST 9.0X17 6/114 ET20|4|5013562|16/12/29|6|554981
Use a filter
when configuring the CSV Format. Read how to create a JavaScript code for the filter.
Example of a filter
Let's assume that the source file has a structure like the one above. Let's also assume that we need to read the CSV file and split it on a "header" which contains line number 2:
XYZ 2|LLL001|MMM0124386|BULK|17/01/11|0000002205|TIGER WHEEL|SHOPPING CENTRE BRITS NORTH WEST|0250|0122527777|2|
and "items," from line 3 downward look like this:
77777PYT12P515012MBK|TEST 9.0X17 5/150 ET12|4|5013562|16/12/29|3|554981
77777ROG20P611466MBK|TEST 9.0X17 6/114 ET20|4|5013562|16/12/29|6|554981
The end goal is to parse the source file and store the "header" part of the file in an order
table in the database and the "items" part in an item
table.
To parse files like the one above, we will create two different CSV Formats, one for the "header" and one for the "items."
Step 1. Create a Format for the "header":
Delimiter
: | (pipe).Use first Row for Data
: if selected, it tells Etlworks that there is no row with column names anywhere, so it must use the default column names:field1
,field2
, etc.Skip First Row
: if selected, it tells Etlworks that the first row, containing the wordHeaderInfo
, must be skipped as if it never existed.Transformation type
: selectFilter
(default).Filter or Preprocessor
: the JavaScript code below tells Etlworks tostop
parsing the file after line 1.
Since the first line is completely ignored, the real first line starts from the former line 2: XYZ 2...
-
if (filter.getIndex() > 1) { filter.setCode(TextFilter.TextFilterCode.STOP); }
Step 2. Create a Format for the "items":
Delimiter
: | (pipe).Use first Row for Data
: if selected, it tells Etlworks that there is no row with column names anywhere, so it must use the default column names:field1
,field2
, etc.Skip First Row
: if selected, it tells Etlworks that the first row containing the wordHeaderInfo
, must be skipped as if it never existed.Filter
: the JavaScript code below tells Etlworks toskip
(REJECT) the first line.
Since the first line is completely ignored, the real first line starts from the former line 2: XYZ 2...
-
if (filter.getActualIndex() == 1) { filter.setCode(TextFilter.TextFilterCode.REJECT); filter.setIndex(1); }
Comments
0 comments
Please sign in to leave a comment.