When to use this Format
CSV (or comma-separated values) is one of the most commonly used data exchange Formats. In Etlworks, you can actually define what character is used as a separator between values and lines, as well as other parameters.
Use CSV Format when configuring a source-to-destination transformation that reads or writes CSV files.
Here are some additional uses cases:
- Loading data into Snowflake
- Loading data into Amazon Redshift
- Loading data into Azure Synapse Analytics
- Loading data into Google BigQueue
- Loading data into Greenplum
- Bulk loading data into a database
Create Format
To create a new CSV Format, go to Connections, select the Formats tab, click Add Format, and type in csv in the Search field.
Parameters
- Delimiter: a character used as a delimiter between values. The default is the comma:,.
- Enclosure Character: a character used to enclose the field value. By default, only the values that contain Delimiter will be enclosed. Enable Always enclose to enclose all values.
- Always enclose: if this option is enabled, the system will always enclose fields in quotes (assuming that the Enclosure Character is configured). The default behavior is to enclose fields only if they contain the delimiter character.
- Enclose header: if this option is enabled, the system will enclose fields names in the header in quotes (assuming that the Enclosure Character is configured). This option is disabled by default.
- Escape double-quotes: if double-quotes are used to enclose fields, then a double-quote appearing inside a field will be escaped by preceding it with another double quote. This parameter is used only when creating the CSV file.
- Escape Character: this character is used to parse files with escape enclosing quotes. If not entered then a double-quote appearing inside a field is assumed to be escaped by preceding it with another double quote.
- Line Separator: a character used as a separator between lines.
- Default Extension: the default extension is used when the file name doesn't have an extension. If not entered, dat is the default extension.
- Has multiline records: enable this if you expect to have multi-line records in the source document. It only works if Quote is set.
- Ignore Byte order mark character (BOM): uncheck this option if you do not want the parser to ignore the byte order mark character (BOM). The default is ignore.
- BOM size: the size of the BOM character. If Ignore Byte order mark character (BOM) is enabled for the format, the connector will attempt to detect the BOM character in the first line, and if it exists, remove it by executing line.substring(bom_char_size). The actual size of the BOM character depends on the file encoding but in some cases, the wrong BOM is used with an otherwise correctly encoded file (very common for UTF-8 encoded files), so the connector truncates more than it should. If you know the size of the BOM character, set BOM size to the positive value equal to the size of the BOM. The most common size for UTF-8 encoded files is 1.
- Template: a template in the CSV Format. If this field is not empty, Etlworks will use it to populate column names and data types. It is an optional field. Example: first_name,last_name,dob.
- Column names compatible with SQL: this converts column names to SQL compatible column names by removing all characters except alphanumeric and spaces.
- Noname Column: the name of the column when the file does not have a header row. The column name will be the value of the Noname Column, + column index. Example: column,column2 or field1,field2.
- Use First Row for Data: if checked, it is assumed that the file doesn't have a fields header.
- Skip First Row: if this option is enabled, the system skips the first row in a file. Typically, this is used together with Use First Row for Data.
- Skip not Properly Formatted Rows: if this option is enabled, the system skips rows which do not conform to the CSV Format. For example, a row might have a different number of columns than other rows do.
- Skip Empty Rows: sometimes CSV files contain completely empty rows with no values and delimiters. Etlworks can be configured to skip these rows. Otherwise, it will generate an exception when reading such a file.
- Skip rows with fewer columns than header: if this option is enabled, the system skips rows that have fewer columns than a header row.
- Skip rows with more columns than header: if this option is enabled, the system skips rows that have more columns than a header row.
- Document has extra data columns: if this option is enabled, Etlworks will be able to read CSV documents, even if the number of header columns is less than the number of data columns.
- Enforce number of data columns: if this option is enabled, the connector will enforce the number of data columns by setting it to the same value as the number of fields in the CSV header. Use it if the number of fields in the header is less than the number of data columns and you only want to parse the data columns which have the corresponding field in the header.
- Start row: if this value is not empty, the system will start reading the file from the specified 1-based row and will ignore previous rows.
- End row: if this value is not empty, the system will stop reading the file after the specified 1-based row.
- Transformation type: the default is filter, which allows you to write JavaScript code to filter out some of the rows in a CSV file, but you can also select preprocessor or header. Read more about using the preprocessor to modify the contents of the source document. If header is selected and Filter or Preprocessor or Header is not empty, the header will be added at the beginning of the file, followed by an end-of-line character, followed by the actual data.
- Filter or Preprocessor or Header: this is code written in JavaScript, which is used to filter out rows, change rows, or the entire document. Read more about filters. If header is selected for Transformation type, the specified header will be added at the beginning of the file, followed by an end-of-line character, followed by the actual data.
- All fields are strings: if this option is enabled (it is disabled by default) the system will create all fields with a string data type. Otherwise, it will parse the field's value and attempt to detect the data type.
- Save Metadata: if this option is enabled the system will create an XML file with the same name as the CSV file. The XML file contains information about the actual data types of the columns as it was detected during the extract from the database. If present this information will be used during the load to set the data types of the columns in the destination. Enable this option if you want to preserve the exact data types when extracting data from a database and creating the CSV files. Read more about enabling this option for the CDC connector.
- Date and Time Format: a Format for timestamps (date & time).
- Date Format: a Format for date (date only, no time).
- Time Format: a Format for time (time only, no date).
- Parse Dates: if this option is enabled, and the date or time value is not recognized as one of the Formats defined above, Etlworks will try to parse it using one of the well-known date & time Formats.
- Trim Strings: if this option is enabled, Etlworks will trim leading and trailing white-spaces from the value.
- Treat 'null' as null: if this option is enabled, Etlworks will treat string values equal to null as actual nulls (no value).
- Value for null: a string that will be used instead of a null value. A typical usage example is setting Value for null to \N so the Redshift COPY command can differentiate between an empty string and NULL value. Read more about using this option to differentiate between SQL NULL and empty string (").
- Encode CLOB fields using Base64: if this option is enabled (default), Etlworks will encode fields with the CLOB data type (large TEXT fields) using the Base64 algorithm.
- C: if this option is enabled system will convert the value of the column with the BOOLEAN or BIT data type from true/ false to numeric 1/ 0 when creating the file.
- Remove EOL characters: if this option is enabled (default), the system will remove end-of-line (EOL) characters (\n, \r) from the field's value when creating a file.
- Reorder columns based on the order of columns in mapping: when this option is enabled (it is disabled by default) the CSV connector is able to create CSV files with the specific order of columns. Simply enable this option for the destination CSV Format and configure the order of fields in mapping.
- S: if this option is enabled (it is disabled by default), the system will strip not printable characters, such as null-character (/0), from the data row when creating a file.
- Maximum number of rows in file: the maximum number of rows in the file when creating new CSV files. Use it to split a large CSV document while creating it. It is extremely fast and efficient, compared to splitting the existing document.
- Encoding: character encoding when reading and writing CSV files. No encoding means there will be no additional encoding.
Preserve order of columns as in mapping when creating CSV file
By default when the flow creates a CSV file and there is user-defined mapping it keeps the original order of columns in the source and adds columns which do not exist in the source to the end, therefore ignoring the order of columns in the mapping.
To preserve order of columns as in mapping enable Reorder columns based on the order of columns in mapping.
Read non-standard CSV files
It is possible that your source CSV file is not properly formatted, does not conform to the CSV Format standard, or has other issues.
In Etlworks, you can modify the properties of the CSV Format so it could be parsed regardless of how broken or non-standard it is.
Read more.
Global find and replace when creating CSV files
The Etlworks CSV connector generates CSV files line by line, with each column value stored as a string. You can perform a global find and replace operation within these string values by configuring the following parameters:
- : the regular expression in Java format to match and replace a string when creating a file.
- : the string used to replace the text matched by the regex. If this parameter is not set by the find regex is configured the connectors sets it to to empty string ("").
For example if you want to remove all single quotes which are not at the beginning and not at the end of the string (like ) you can set find regex to and keep the replacement parameter empty.
Using an example above the string will be transformed to .
Filter rows in a flat file
Example
It is common to need to filter out rows in a file. For example, the Excel worksheet below has four lines that can be considered headers, but only the first is an actual header. There is a way to filter out some of the rows in a file.
Here are the steps to do this:
Step 1. Create an ETL Flow where the source is a file.
Step 2. Use filter transformation to filter out rows in a file.
For the example above, the filter would need to be set to:
row == 0 || row > 3
Read non-standard CSV files
In Etlworks, you can modify the properties of the CSV Format so it could be parsed regardless of how broken or non-standard it is.
Use cases
CSV file with empty lines
Example:
first,last,age
john,doe,23
...empty line...
mad,max,45
When configuring the CSV Format, select the Skip Empty Rows check box.
The header does not conform to the CSV Format
just a header
john,doe,23
mad,max,45
When configuring the CSV Format, select the Skip First Row check box and also (typically) Use First Row for Data.
Some of the lines in the file have a different number of columns
first,last,age
john,doe,23
wrong,number,of,columns
mad,max,45
When configuring the CSV Format, select the Skip not Properly Formatted Rows check box.
The header has fewer columns than the data rows
first,last
john,doe,23,01/01/2000
mad,max,45,01/03/1977
When configuring CSV Format, select the Document has extra data columns check box and deselect Use First Row for Data.
Some of the data rows have a fewer or greater number of columns than a header
first,last,age
joe,doe,42
simba,dog
jane,doe,26,Pittsburgh
When configuring the CSV Format, enable one or both options below:
- Skip rows with fewer columns than header: if this option is enabled, the system skips rows that have fewer columns than a header row.
- Skip rows with more columns than header: if this option is enabled, the system skips rows that have more columns than a header row.
Reading CSV files with multiline records
Your source CSV file may have records (rows) that span multiple lines.
first,last,description,dob
Joe,Doe,"just a
guy","0102000"
Brad,Pit,"an actor","0101965"
As you can see, for some of the records, the value of the Description field occupies more than one line.
To handle this, enable Has multiline records when configuring the CSV Format.
CSV file has non-standard BOM character
The special character known as the byte order marking (BOM) character is commonly found as the first line of a CSV file.
If Ignore Byte order mark character (BOM) is enabled for the CSV Format, the connector will attempt to detect the BOM character in the first line, and if it exists, remove it by executing line.substring(bom_char_size). The actual size of the BOM character depends on the file encoding but in some cases, the wrong BOM is used with an otherwise correctly encoded file (very common for UTF-8 encoded files), so the connector truncates more than it should. If you know the size of the BOM character, set BOM size to the positive value equal to the size of the BOM. The most common size for UTF-8 encoded files is 1.
CSV file with non standard character used to escape double quotes
Some CSV files may use a non standard character to escape enclosing double quotes, for example \:
"Title","Author","Year"
"The Catcher in the Rye, Special Edition","J.D. Salinger","1951"
"\"To Kill a Mockingbird\" by Harper Lee","Harper Lee","1960"
"1984","George Orwell","1949"
You can configure the escape character for the specific CSV format. This parameter is only used when parsing the CSV or Fixed length text file.
If this parameter is not set then a double-quote appearing inside a string is assumed to be escaped by preceding it with another double quote.
The CSV file has a completely irregular structure and cannot be parsed by a standard parser
In the example below, the first line must be ignored, the second line contains a "header" with one column, while lines from the third downward contain "items" with a different number of columns:
HeaderInfo
XYZ 2|LLL001|MMM0124386|BULK|17/01/11|0000002205|TIGER WHEEL|SHOPPING CENTRE BRITS NORTH WEST|0250|0122527777|2|
77777PYT12P515012MBK|TEST 9.0X17 5/150 ET12|4|5013562|16/12/29|3|554981
77777ROG20P611466MBK|TEST 9.0X17 6/114 ET20|4|5013562|16/12/29|6|554981
Use a filter when configuring the CSV Format. Read how to create a JavaScript code for the filter.
Example of a filter
Let's assume that the source file has a structure like the one above. Let's also assume that we need to read the CSV file and split it on a "header" which contains line number 2:
XYZ 2|LLL001|MMM0124386|BULK|17/01/11|0000002205|TIGER WHEEL|SHOPPING CENTRE BRITS NORTH WEST|0250|0122527777|2|
and "items," from line 3 downward look like this:
77777PYT12P515012MBK|TEST 9.0X17 5/150 ET12|4|5013562|16/12/29|3|554981
77777ROG20P611466MBK|TEST 9.0X17 6/114 ET20|4|5013562|16/12/29|6|554981
The end goal is to parse the source file and store the "header" part of the file in an order table in the database and the "items" part in an item table.
To parse files like the one above, we will create two different CSV Formats, one for the "header" and one for the "items."
Step 1. Create a Format for the "header":
- Delimiter: | (pipe).
- Use first Row for Data: if selected, it tells Etlworks that there is no row with column names anywhere, so it must use the default column names: field1, field2, etc.
- Skip First Row: if selected, it tells Etlworks that the first row, containing the word HeaderInfo, must be skipped as if it never existed.
- Transformation type: select Filter (default).
- Filter or Preprocessor: the JavaScript code below tells Etlworks to stop parsing the file after line 1.
Note: Since the first line is completely ignored, the real first line starts from the former line 2: XYZ 2...
-
if (filter.getIndex() > 1) { filter.setCode(TextFilter.TextFilterCode.STOP); }
Step 2. Create a Format for the "items":
- Delimiter: | (pipe).
- Use first Row for Data: if selected, it tells Etlworks that there is no row with column names anywhere, so it must use the default column names: field1, field2, etc.
- Skip First Row: if selected, it tells Etlworks that the first row containing the word HeaderInfo, must be skipped as if it never existed.
- Filter: the JavaScript code below tells Etlworks to skip (REJECT) the first line.
Note: Since the first line is completely ignored, the real first line starts from the former line 2: XYZ 2...
-
if (filter.getActualIndex() == 1) { filter.setCode(TextFilter.TextFilterCode.REJECT); filter.setIndex(1); }
Preserve exact data types when extracting from a database to CSV
Example
When Etlworks Flow creates a CSV file, by default, the information about the column's data types in the source database is getting lost. So if another Flow reads the same file, it assigns the data types to the columns by sampling the data, which is not always 100% accurate.
If you want to preserve the exact data types, enable the option Save Metadata for the source and the destination CSV Formats. If this option is enabled, the system will store column names and data types in the XML file with the same name and in the same location as the original CSV file.
When to enable this option
Generally speaking, it makes sense to enable this option if the CSV file is created by Etlworks Flow that extracts data from a database and then used by another Etlworks Flow to load data into a different database.
Split a CSV file when creating it
Typically, it makes sense only for large documents (millions of rows) because it can greatly improve the performance of certain data load tasks, such as loading data into the online data warehouses: Snowflake, Redshift, etc. These systems can load files in parallel. Transferring smaller files over the network can often be faster.
To always split the CSV file when creating it, create a new CSV Format and enter a positive value in the field Maximum number of rows in file. Then use the created Format in the TO part of the source-to-destination transformation.
Add a custom header to a CSV file
Example
You need to create a CSV file with a custom header to submit data into Amazon MWS.
When configuring the CSV Format, set the Transformation type to header and specify the Header parameter.
Enable base64 encoding of fields in CSV files
Overview
When the Flow extracts data from the source and creates CSV files, the columns with the CLOB (TEXT, LONGTEXT, etc.) data types can be converted to the binary string encoded using base64 algorithm. The encoded strings looks like the following: H4sIAAAAAAAAAAXByxGCMBAAUC6WEwL5YPBiKZmwLLIMbJhkHc.
If Base64 encoding is enabled, Etlworks also automatically encodes nested datasets.
If you want Etlworks to encode CLOB fields and nested datasets enable option Encode CLOB fields using Base64 when creating or modifying CSV format.
Keep this option disabled if you want Etlworks to preserve the original string value like it is stored in the source. This option is disabled by default.