HTML Format – Etlworks Support

When to use this Format

Use HTML Format when configuring a source-to-destination transformation that reads or writes HTML files.

How it works

When the HTML connector reads a document in HTML format, it attempts to identify and parse tables within the document. If there is only one unique table, the connector will generate a single, flat dataset. However, if multiple unique tables are detected, the connector will create a nested document, where each table becomes a column in a dataset with a single row. It will merge multiple tables with identical structure (if exist) into a single table. This structure allows for easier handling of multiple tables within a single document.

When writing data to an HTML file, the connector creates an HTML <table> element to represent the data. Additionally, it automatically generates a header with inline CSS to enhance the look and feel of the table, ensuring better readability and presentation in a web browser.

Additionally, developers can use the Jsoup library, from either JavaScript or Python code, to parse and manipulate the HTML documents. For more information on how to use Jsoup for parsing HTML documents, refer to this article.

Create Format

To create a new HTML Format, go to Connections, select Formats tab, click Add Format, and type in html in the Search field.

Properties

Header: Use this header to create a complete HTML document with customized styles. If this parameter is left empty, the connector will default to the basic header: <html><body><table>.
Preprocessor: JavaScript code that modifies the contents of the source document. Read more about how to use a Preprocessor.
Column names compatible with SQL: This feature converts column names to be SQL-compliant by removing any characters that are not alphanumeric or spaces. The resulting column names are formatted to ensure compatibility with SQL databases, making them suitable for use in queries and table definitions.
Noname column: Specifies the default column name when an actual name is not present in the document, such as when the file lacks a header row. The generated column name follows the pattern: noname.column.name + column index. Example: column1, column2 or field1, field2.
Skip Empty Rows: PDF files may occasionally contain completely empty rows without any values or delimiters. Enabling this option allows Etlworks to skip such rows during processing. If disabled, encountering an empty row will result in an exception when reading the file.
Treat 'null' as null: When this option is enabled, Etlworks will interpret string values that are literally "null" as actual null values, meaning no value is present. This helps differentiate between the string "null" and a true null in your data.
All fields are strings: By default, the connector determines column data types by sampling the data. Enable this setting to force all columns to be treated as strings (VARCHAR).
Date and Time Format: Specifies the format for timestamps that include both date and time values. This format will be used to parse and format timestamp fields throughout the system.
Date: Specifies the format for timestamps that include date values only. This format will be used to parse and format date fields throughout the system.
Time: Specifies the format for timestamps that include time values only. This format will be used to parse and format time fields throughout the system.
Parse Dates: If a Date or Time value is not recognized using the default formats, the system will attempt to parse it automatically and determine the appropriate format.
Encoding: character encoding. No Encoding means there will be no additional encoding.

Table Selection and Prioritization

When extracting tables from an HTML document, the connector may detect multiple tables with the same or different structures. To provide greater flexibility in selecting the most relevant table for processing, we have introduced Table Selection Criteria and Main Table Columns settings.

Why This Is Needed

By default, the connector extracts and returns all detected tables and creates a nested dataset. However, in cases where users need to work with a specific table, they can now configure selection rules to automatically identify the most relevant table based on its structure or predefined column names.

Table Selection Criteria

This setting allows users to define how the connector selects the main table when multiple tables are found. The available options are:

All Tables (Default) – No specific table is selected; all tables are returned in a nested dataset.
Most Columns – The table with the highest number of columns is selected, prioritizing data-rich tables.
Most Rows – The table with the highest number of rows is selected, prioritizing tables with the most data records.

Main Table Columns

This setting allows users to specify a comma-separated list of column names that should be present in the main table. If a table contains all specified columns, it is automatically selected, regardless of the Table Selection Criteria setting. If no table matches, the system falls back to the Table Selection Criteria.

Matching Logic:

The specified columns can appear in any order within the table.
Case-insensitive matching is used (e.g., ColumnName and columnname are considered the same).
Any number of spaces between column names is ignored.
If a table has additional columns, it is still considered a match as long as all specified columns are present.

How It Works:

If Main Table Columns is set and a table with these columns exists, that table is selected.
If no match is found, the connector follows the Table Selection Criteria setting.
If Main Table Columns is set and a matching table is found, it overrides the selection criteria.

Articles in this section