When to use this Format
Use HTML Format when configuring a source-to-destination transformation that reads or writes HTML files.
How it works
When the HTML connector reads a document in HTML format, it attempts to identify and parse tables within the document. If there is only one unique table, the connector will generate a single, flat dataset. However, if multiple unique tables are detected, the connector will create a nested document, where each table becomes a column in a dataset with a single row. This structure allows for easier handling of multiple tables within a single document.
When writing data to an HTML file, the connector creates an HTML <table> element to represent the data. Additionally, it automatically generates a header with inline CSS to enhance the look and feel of the table, ensuring better readability and presentation in a web browser.
Additionally, developers can use the Jsoup library, from either JavaScript or Python code, to parse and manipulate the HTML documents. For more information on how to use Jsoup for parsing HTML documents, refer to this article.
Create Format
To create a new HTML Format, go to Connections
, select Formats
tab, click Add Format
, and type in html
in the Search
field.
Properties
-
Header
: Use this header to create a complete HTML document with customized styles. If this parameter is left empty, the connector will default to the basic header: <html><body><table>. -
Column names compatible with SQL
: This feature converts column names to be SQL-compliant by removing any characters that are not alphanumeric or spaces. The resulting column names are formatted to ensure compatibility with SQL databases, making them suitable for use in queries and table definitions. -
Skip Empty Rows
: PDF files may occasionally contain completely empty rows without any values or delimiters. Enabling this option allows Etlworks to skip such rows during processing. If disabled, encountering an empty row will result in an exception when reading the file. -
Treat 'null' as null
: When this option is enabled, Etlworks will interpret string values that are literally "null" as actual null values, meaning no value is present. This helps differentiate between the string "null" and a true null in your data. -
Date and Time Format
: Specifies the format for timestamps that include both date and time values. This format will be used to parse and format timestamp fields throughout the system. -
Date
: Specifies the format for timestamps that include date values only. This format will be used to parse and format date fields throughout the system. -
Time
: Specifies the format for timestamps that include time values only. This format will be used to parse and format time fields throughout the system. -
Encoding
: character encoding.No Encoding
means there will be no additional encoding.
Comments
0 comments
Please sign in to leave a comment.