Extract, transform, and load data in Google BigQuery – Etlworks Support

Use this flow to extract data from any supported source, transform it, and load it into Google BigQuery. Etlworks stages CSV files in Google Cloud Storage, creates an external table over them, then loads the rows into the destination BigQuery table.

Which BigQuery flow should I use?

Flow	Use when
Any to BigQuery (this article) Database / File / Queue / Web service / Well-known API → BigQuery	You need to extract from any source, optionally transform, and load into BigQuery.
Bulk load files in Google Cloud Storage into BigQuery	The files already exist in GCS. No transformation needed. Auto-generates the load command; supports MERGE.
Stream CDC events into BigQuery	You need real-time replication from a CDC-enabled source database.
Streaming with message queues	You need real-time ingestion from a message queue that supports streaming.

Google BigQuery is a serverless data warehouse. Row-by-row inserts are very slow; the flows above use staged CSV + external tables + SQL load, which is the supported high-performance load path.

How does this flow work?

Extracts data from the source.
Writes the data to CSV files in Google Cloud Storage.
Creates a BigQuery external table over the staged files.
Loads data from the external table into the actual BigQuery table using SQL.
Cleans up staged files.

What do I need before I start?

BigQuery is available from your Etlworks instance.
A Google Cloud Storage bucket exists and the service account has read/write permissions on it.

Step-by-step setup

Step 1. Create connections

You need three connections:

A source connection (database, file storage, web service, queue, or well-known API).
A Google Cloud Storage connection for the stage.
A BigQuery connection.

Step 2. Create a CSV format

Create a CSV format. Set the date and time formats to ones BigQuery understands: yyyy-MM-dd HH:mm:ss.SSS.

CSV format settings

Step 3. Create the flow

Open Flows, click +, type bigquery in the gallery search, and pick the flow type that matches your source.

Flow gallery search for BigQuery

Add the source-to-destination transformation and continue with the steps below.

Transformation

Step 4. Set the BigQuery connection on the flow

Open the Connections tab and pick the BigQuery connection from Step 1.

Flow Connections tab

Step 5. Set source (FROM) and destination (TO)

FROM (varies by flow type):

Database: table or view name. Wildcards supported, e.g. public.*.
Well-known API: endpoint or file name.
Web service: endpoint name.
File: file name or wildcard pattern.
Queue: queue name.

TO: the fully qualified BigQuery table name. If FROM uses a wildcard, use a matching wildcard in TO.

The GCS connection is the destination connection on the transformation. Etlworks uses it as the staging area.

FROM and TO

Step 6. Set the CSV format

Select the format you created in Step 2 as the destination format.

Step 7. Set parameters (optional)

Click MAPPING on the transformation row, then go to the Parameters tab.

Load behavior

Parameter	What it does
Column List	Comma-separated list of external-table column definitions in the form name1 schema1, name2 schema2. Columns must be in the same order as in the file.
Decimal Target Types	How Decimal values are converted. Equivalent to BigQuery's ExternalDataConfiguration.decimal_target_types. Example: ["NUMERIC", "BIGNUMERIC"].
String that represents NULL values in a CSV file	The source string that maps to SQL NULL on load.
Ignore Unknown Values	If on, source fields that aren't in the table schema are dropped instead of failing the load.
Max # of Errors to Ignore	Maximum number of bad rows to skip before failing the load.
Action	INSERT inserts new rows; MERGE upserts on lookup fields; CDC MERGE applies CDC change events (insert / update / delete).
Lookup Fields	Comma-separated list of columns that uniquely identify a record. Required when Action is MERGE or CDC MERGE.
MERGE SQL	User-defined MERGE SQL. Tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}.
CDC MERGE SQL	User-defined CDC-MERGE SQL. Tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}, {UPDATE_FIELDS}, {UPDATE_CONDITIONS}.
Purge File if Success	On by default (when not using user-defined SQL). Deletes the staged file after a successful load.
Purge File(s) if Error	On by default. Deletes staged files on load failure.

ELT — SQL before and after load

Parameter	What it does
Before LOAD SQL / Ignore errors / is a script	SQL to run on the BigQuery connection before the load. Optional error suppression and full-script mode.
After LOAD SQL / Ignore errors / is a script	Same three options, but run after the load.

Debugging

Parameter	What it does
Log each executed SQL statement	Logs every auto-generated and executed SQL statement, including before-/after-LOAD SQL.

Source-specific parameters

Database source: see Extract, Transform, and Load (ETL) data for source-database options.
File or web-service source:
- Ignore Transformations: copies files directly to GCS without applying transformations. Much faster, but you lose any transformation steps.
- Delete loaded source files: deletes the source file after a successful load.
- Delete source files on error: deletes the source file if the load fails.
Well-known API source: see the well-known API options.

Step 8. Configure MERGE (UPSERT) (optional)

Set Action to MERGE.
Provide Lookup Fields — a comma-separated list of columns that uniquely identify a record. Or enable Predict Lookup Fields for auto-detection (less reliable).

For wildcard sources (Step 9 below) that target multiple tables, specify per-table lookup fields with fully qualified names, separated by semicolons:

test1.inventory=inventory_id,database_name;
test1.payment=payment_id,database_name;
test1.rental=rental_id,database_name;

BigQuery MERGE configuration

Step 9. Handle source / destination schema differences

Option	What it does
Reorder CSV Columns During Extract	Reorders columns in the extracted file to match the target. Ignored if the target table doesn't exist.
Ignore Fields in Source Not in Target	Drops source columns the target doesn't have. Mutually exclusive with the next two.
Alter Target Table If Source Has Extra Columns	Adds the extra columns to the BigQuery table.
Recreate Target Table If Source Has Extra Columns	Drops and recreates the target table. Destructive.
Insert NULL Into Target Fields Not in Source	Inserts NULL for any target column the source doesn't have.
Create Table SQL	Custom SQL or JavaScript template for creating the target table. See overriding the create-table SQL.

Notifications on schema mismatch

If the source and target columns don't match, the flow logs an exception. You can wire that exception into an email notification — see sending notifications about failed transformations.

Step 10. Configure mapping (optional)

If you need explicit field-to-field mapping, set it up in Mapping. Mapping is not required; auto-mapping by field name works for most flows.

Incremental change replication (high watermark)

Like any other ETL flow, this flow supports change replication using a high watermark (HWM). Only records that changed since the last run are loaded.

Set the High Watermark Field — the column that tracks changes.
Enable Change Replication on the transformation.
Optional: set a calculated HWM value for clearer tracking.

HWM configuration

Load multiple tables with a wildcard

FROM

Create a single source-to-destination transformation.
Set FROM to a fully qualified source object name with wildcards, e.g. public.*.

By default every table and view matching the pattern is included.

Wildcard FROM

TO

Set TO to a BigQuery dataset with a wildcard, e.g. dataset.*.

Wildcard TO

Include / exclude objects

Exclude Objects: comma-separated list of objects to skip. Use all views to skip every view; all tables to skip every table.
Include Objects: comma-separated list of objects to keep.

Source query, before/after SQL

Use the {TABLE} token in Source Query, Before LOAD SQL, and After LOAD SQL.

Wildcard source query

MERGE with wildcards

Leave Lookup Fields empty and enable Predict Lookup Fields, or
Specify per-table lookup fields as semicolon-separated fully_qualified_table=field_list pairs.

HWM with wildcards

Use the {TABLE} token in the High Watermark Field Value.

Flow variables

Open the Parameters tab and add user-defined flow variables as key-value pairs. Use uppercase names without spaces.

Maximum number of parallel threads

Multiple source-to-destination transformations (extract-load) can run in parallel threads. The default limit is 5; the maximum is 99. Change it via Maximum Number of Parallel Threads on the Parameters tab.

Troubleshooting

See Common issues when loading data into cloud data warehouses for common errors and fixes.

Articles in this section