When to use this Format
CSV (or comma-separated values) is one of the most commonly used data exchange Formats. In Etlworks, you can actually define what character is used as a separator between values and lines, as well as other parameters.
Use CSV Format when configuring a source-to-destination transformation that reads or writes CSV files.
Here are some additional uses cases:
- Loading data into Snowflake
- Loading data into Amazon Redshift
- Loading data into Azure Synapse Analytics
- Loading data into Google BigQueue
- Loading data into Greenplum
- Bulk loading data into a database
Create Format
To create a new CSV Format, go to Connections
, select the Formats
tab, click Add Format
, and type in csv
in the Search
field.
Parameters
-
Delimiter
: a character used as a delimiter between values. The default is the comma:,
. -
Enclosure Character
: a character used to enclose the field value. By default, only the values that contain Delimiter will be enclosed. EnableAlways enclose
to enclose all values. -
Always enclose
: if this option is enabled, the system will always enclose fields in quotes (assuming that theEnclosure Character
is configured). The default behavior is to enclose fields only if they contain the delimiter character. -
Enclose header
: if this option is enabled, the system will enclose fields names in the header in quotes (assuming that theEnclosure Character
is configured). This option is disabled by default. -
Escape double-quotes
: if double-quotes are used to enclose fields, then a double-quote appearing inside a field will be escaped by preceding it with another double quote. This parameter is used only when creating the CSV file. -
Escape Character
: this character is used to parse files with escape enclosing quotes. If not entered then a double-quote appearing inside a field is assumed to be escaped by preceding it with another double quote. -
Line Separator
: a character used as a separator between lines. -
Default Extension
: the default extension is used when the file name doesn't have an extension. If not entered,dat
is the default extension. -
Has multiline records
: enable this if you expect to have multi-line records in the source document. It only works ifQuote
is set. -
Ignore Byte order mark character (BOM)
: uncheck this option if you do not want the parser to ignore the byte order mark character (BOM
). The default isignore
. -
BOM size
: the size of the BOM character. IfIgnore Byte order mark character (BOM)
is enabled for the format, the connector will attempt to detect the BOM character in the first line, and if it exists, remove it by executingline.substring(bom_char_size)
. The actual size of the BOM character depends on the file encoding but in some cases, the wrong BOM is used with an otherwise correctly encoded file (very common for UTF-8 encoded files), so the connector truncates more than it should. If you know the size of the BOM character, setBOM size
to the positive value equal to the size of the BOM. The most common size for UTF-8 encoded files is 1. -
Template
: a template in the CSV Format. If this field is not empty, Etlworks will use it to populate column names and data types. It is an optional field. Example:first_name,last_name,dob
. -
Column names compatible with SQL
: this converts column names to SQL compatible column names by removing all characters except alphanumeric and spaces. -
Noname Column
: the name of the column when the file does not have a header row. The column name will be the value of theNoname Column,
+column index
. Example:column,column2
orfield1,field2
. -
Use First Row for Data
: if checked, it is assumed that the file doesn't have afields
header. -
Skip First Row
: if this option is enabled, the system skips the first row in a file. Typically, this is used together with Use First Row for Data. -
Skip not Properly Formatted Rows
: if this option is enabled, the system skips rows which do not conform to the CSV Format. For example, a row might have a different number of columns than other rows do. -
Skip Empty Rows
: sometimes CSV files contain completely empty rows with no values and delimiters. Etlworks can be configured to skip these rows. Otherwise, it will generate an exception when reading such a file. -
Skip rows with fewer columns than header
: if this option is enabled, the system skips rows that have fewer columns than a header row. -
Skip rows with more columns than header
: if this option is enabled, the system skips rows that have more columns than a header row. -
Document has extra data columns
: if this option is enabled, Etlworks will be able to read CSV documents, even if the number of header columns is less than the number of data columns. -
Enforce number of data columns
: if this option is enabled, the connector will enforce the number of data columns by setting it to the same value as the number of fields in the CSV header. Use it if the number of fields in the header is less than the number of data columns and you only want to parse the data columns which have the corresponding field in the header. -
Start row
: if this value is not empty, the system will start reading the file from the specified 1-based row and will ignore previous rows. -
End row
: if this value is not empty, the system will stop reading the file after the specified 1-based row. -
Transformation type
: the default isfilter
, which allows you to write JavaScript code to filter out some of the rows in a CSV file, but you can also selectpreprocessor
orheader
. Read more about using the preprocessor to modify the contents of the source document. Ifheader
is selected andFilter or Preprocessor or Header
is not empty, the header will be added at the beginning of the file, followed by an end-of-line character, followed by the actual data. -
Filter or Preprocessor or Header
: this is code written in JavaScript, which is used to filter out rows, change rows, or the entire document. Read more about filters. Ifheader
is selected forTransformation type
, the specified header will be added at the beginning of the file, followed by an end-of-line character, followed by the actual data. -
All fields are strings
: if this option is enabled (it is disabled by default) the system will create all fields with a string data type. Otherwise, it will parse the field's value and attempt to detect the data type. -
Save Metadata
: if this option is enabled the system will create an XML file with the same name as the CSV file. The XML file contains information about the actual data types of the columns as it was detected during the extract from the database. If present this information will be used during the load to set the data types of the columns in the destination. Enable this option if you want to preserve the exact data types when extracting data from a database and creating the CSV files. Read more about enabling this option for the CDC connector. -
Date and Time Format
: a Format for timestamps (date & time). -
Date Format
: a Format for date (date only, no time). -
Time Format
: a Format for time (time only, no date). -
Parse Dates
: if this option is enabled, and the date or time value is not recognized as one of the Formats defined above, Etlworks will try to parse it using one of the well-known date & time Formats. -
Trim Strings
: if this option is enabled, Etlworks will trim leading and trailing white-spaces from the value. -
Treat 'null' as null
: if this option is enabled, Etlworks will treat string values equal tonull
as actual nulls (no value). Value for null
: a string that will be used instead of a null value. A typical usage example is settingValue for null
to\N
so the RedshiftCOPY
command can differentiate between an empty string andNULL
value. Read more about using this option to differentiate between SQLNULL
and empty string(")
.-
Encode CLOB fields using Base64
: if this option is enabled (default), Etlworks will encode fields with the CLOB data type (large TEXT fields) using the Base64 algorithm. C
: if this option is enabled system will convert the value of the column with theBOOLEAN
orBIT
data type fromtrue
/false
to numeric1
/0
when creating the file.-
Remove EOL characters
: if this option is enabled (default), the system will remove end-of-line (EOL) characters (\n
,\r
) from the field's value when creating a file. Reorder columns based on the order of columns in mapping
: when this option is enabled (it is disabled by default) the CSV connector is able to create CSV files with the specific order of columns. Simply enable this option for the destination CSV Format and configure the order of fields in mapping.S
: if this option is enabled (it is disabled by default), the system will strip not printable characters, such as null-character (/0
), from the data row when creating a file.-
Maximum number of rows in file
: the maximum number of rows in the file when creating new CSV files. Use it to split a large CSV document while creating it. It is extremely fast and efficient, compared to splitting the existing document. -
Encoding
: character encoding when reading and writing CSV files.No encoding
means there will be no additional encoding.
Preserve order of columns as in mapping when creating CSV file
By default when the flow creates a CSV file and there is user-defined mapping it keeps the original order of columns in the source and adds columns which do not exist in the source to the end, therefore ignoring the order of columns in the mapping.
To preserve order of columns as in mapping enable Reorder columns based on the order of columns in mapping
.
Read non-standard CSV files
It is possible that your source CSV file is not properly formatted, does not conform to the CSV Format standard, or has other issues.
In Etlworks, you can modify the properties of the CSV Format so it could be parsed regardless of how broken or non-standard it is.
Global find and replace when creating CSV files
The Etlworks CSV connector generates CSV files line by line, with each column value stored as a string. You can perform a global find and replace operation within these string values by configuring the following parameters:
: the regular expression in Java format to match and replace a string when creating a file.
: the string used to replace the text matched by the
regex. If this parameter is not set by the find regex is configured the connectors sets it to to empty string ("").
For example if you want to remove all single quotes which are not at the beginning and not at the end of the string (like ) you can set find regex to
and keep the replacement parameter empty.
Using an example above the string will be transformed to
.
Comments
0 comments
Please sign in to leave a comment.