PDF Format – Etlworks Support

When to use this Format

Use PDF Format when configuring source-to-destination transformation that reads or writes PDF files.

How it works

When the PDF connector reads a document in PDF format, it attempts to identify and parse tables within the document. If there is only one unique table, the connector will generate a single, flat dataset. However, if multiple unique tables are detected, the connector will create a nested document, where each table becomes a column in a dataset with a single row. This structure allows for easier handling of multiple tables within a single document.

The connector offers flexibility by using different algorithms to identify and parse tables, ensuring table extraction even from complex documents. Additionally, if a table spans multiple pages in a PDF, the connector will attempt to process it as a single continuous table, merging content across pages.

When writing data to PDF file, the connector creates a structured, table-like representation of the data. The developer can specify the page format, which by default is set to A4, allowing for customization based on the document’s layout requirements. This ensures that data is formatted neatly and consistently when exporting to PDF.

Create Format

To create a new PDF format, go to Connections, select Formats tab, click Add Format, and type in pdf in the Search field.

Parameters

Table Extraction Algorithm: Choose the table extraction algorithm based on the structure of the tables in your PDF:
- Spreadsheet Extraction: Use this option for PDFs with well-structured, grid-like tables that resemble spreadsheets. This algorithm works best when tables have clearly defined rows and columns with visible borders or gridlines.
- Basic Extraction: Select this option for PDFs with less structured tables or when tables are organized without clear borders. This algorithm is ideal for extracting tables based on spacing and alignment rather than strict gridlines.
Start row: If the “Start Row” value is specified, the system will begin reading the file from the given 0-based row index. All rows before the specified row will be ignored, including any header rows. This option allows for targeted extraction when you want to skip initial rows that are irrelevant or contain metadata.
Column names compatible with SQL: This feature converts column names to be SQL-compliant by removing any characters that are not alphanumeric or spaces. The resulting column names are formatted to ensure compatibility with SQL databases, making them suitable for use in queries and table definitions.
Noname column: Specifies the default column name when an actual name is not present in the document, such as when the file lacks a header row. The generated column name follows the pattern: noname.column.name + column index. Example: column1, column2 or field1, field2.
Skip Empty Rows: PDF files may occasionally contain completely empty rows without any values or delimiters. Enabling this option allows Etlworks to skip such rows during processing. If disabled, encountering an empty row will result in an exception when reading the file.
Treat 'null' as null: When this option is enabled, Etlworks will interpret string values that are literally "null" as actual null values, meaning no value is present. This helps differentiate between the string "null" and a true null in your data.
All fields are strings: By default, the connector determines column data types by sampling the data. Enable this setting to force all columns to be treated as strings (VARCHAR).
Use First Row for Data: If selected, the system assumes that the file does not contain a header row for field names. The first row of the file will be treated as data rather than column headers.
Date and Time Format: Specifies the format for timestamps that include both date and time values. This format will be used to parse and format timestamp fields throughout the system.
Date: Specifies the format for timestamps that include date values only. This format will be used to parse and format date fields throughout the system.
Time: Specifies the format for timestamps that include time values only. This format will be used to parse and format time fields throughout the system.
Parse Dates: If a Date or Time value is not recognized using the default formats, the system will attempt to parse it automatically and determine the appropriate format.
Page Size: Specifies one of the predefined PDF page sizes for processing. The default page size is A4, but you can select from other standard sizes as needed.

Articles in this section

When to use this Format

How it works

Create Format

Parameters

Related articles