Use this flow to extract data from any supported source, transform it, and load it into Google BigQuery. Etlworks stages CSV files in Google Cloud Storage, creates an external table over them, then loads the rows into the destination BigQuery table.
Which BigQuery flow should I use?
| Flow | Use when |
|---|---|
|
Any to BigQuery (this article) Database / File / Queue / Web service / Well-known API → BigQuery |
You need to extract from any source, optionally transform, and load into BigQuery. |
| Bulk load files in Google Cloud Storage into BigQuery | The files already exist in GCS. No transformation needed. Auto-generates the load command; supports MERGE. |
| Stream CDC events into BigQuery | You need real-time replication from a CDC-enabled source database. |
| Streaming with message queues | You need real-time ingestion from a message queue that supports streaming. |
Google BigQuery is a serverless data warehouse. Row-by-row inserts are very slow; the flows above use staged CSV + external tables + SQL load, which is the supported high-performance load path.
How does this flow work?
- Extracts data from the source.
- Writes the data to CSV files in Google Cloud Storage.
- Creates a BigQuery external table over the staged files.
- Loads data from the external table into the actual BigQuery table using SQL.
- Cleans up staged files.
What do I need before I start?
- BigQuery is available from your Etlworks instance.
- A Google Cloud Storage bucket exists and the service account has read/write permissions on it.
Step-by-step setup
Step 1. Create connections
You need three connections:
- A source connection (database, file storage, web service, queue, or well-known API).
- A Google Cloud Storage connection for the stage.
- A BigQuery connection.
Step 2. Create a CSV format
Create a CSV format. Set the date and time formats to ones BigQuery understands: yyyy-MM-dd HH:mm:ss.SSS.
Step 3. Create the flow
Open Flows, click +, type bigquery in the gallery search, and pick the flow type that matches your source.
Add the source-to-destination transformation and continue with the steps below.
Step 4. Set the BigQuery connection on the flow
Open the Connections tab and pick the BigQuery connection from Step 1.
Step 5. Set source (FROM) and destination (TO)
FROM (varies by flow type):
- Database: table or view name. Wildcards supported, e.g. public.*.
- Well-known API: endpoint or file name.
- Web service: endpoint name.
- File: file name or wildcard pattern.
- Queue: queue name.
TO: the fully qualified BigQuery table name. If FROM uses a wildcard, use a matching wildcard in TO.
The GCS connection is the destination connection on the transformation. Etlworks uses it as the staging area.
Step 6. Set the CSV format
Select the format you created in Step 2 as the destination format.
Step 7. Set parameters (optional)
Click MAPPING on the transformation row, then go to the Parameters tab.
Load behavior
| Parameter | What it does |
|---|---|
| Column List | Comma-separated list of external-table column definitions in the form name1 schema1, name2 schema2. Columns must be in the same order as in the file. |
| Decimal Target Types | How Decimal values are converted. Equivalent to BigQuery's ExternalDataConfiguration.decimal_target_types. Example: ["NUMERIC", "BIGNUMERIC"]. |
| String that represents NULL values in a CSV file | The source string that maps to SQL NULL on load. |
| Ignore Unknown Values | If on, source fields that aren't in the table schema are dropped instead of failing the load. |
| Max # of Errors to Ignore | Maximum number of bad rows to skip before failing the load. |
| Action | INSERT inserts new rows; MERGE upserts on lookup fields; CDC MERGE applies CDC change events (insert / update / delete). |
| Lookup Fields | Comma-separated list of columns that uniquely identify a record. Required when Action is MERGE or CDC MERGE. |
| MERGE SQL | User-defined MERGE SQL. Tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}. |
| CDC MERGE SQL | User-defined CDC-MERGE SQL. Tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}, {UPDATE_FIELDS}, {UPDATE_CONDITIONS}. |
| Purge File if Success | On by default (when not using user-defined SQL). Deletes the staged file after a successful load. |
| Purge File(s) if Error | On by default. Deletes staged files on load failure. |
ELT — SQL before and after load
| Parameter | What it does |
|---|---|
| Before LOAD SQL / Ignore errors / is a script | SQL to run on the BigQuery connection before the load. Optional error suppression and full-script mode. |
| After LOAD SQL / Ignore errors / is a script | Same three options, but run after the load. |
Debugging
| Parameter | What it does |
|---|---|
| Log each executed SQL statement | Logs every auto-generated and executed SQL statement, including before-/after-LOAD SQL. |
Source-specific parameters
- Database source: see Extract, Transform, and Load (ETL) data for source-database options.
-
File or web-service source:
- Ignore Transformations: copies files directly to GCS without applying transformations. Much faster, but you lose any transformation steps.
- Delete loaded source files: deletes the source file after a successful load.
- Delete source files on error: deletes the source file if the load fails.
- Well-known API source: see the well-known API options.
Step 8. Configure MERGE (UPSERT) (optional)
- Set Action to MERGE.
- Provide Lookup Fields — a comma-separated list of columns that uniquely identify a record. Or enable Predict Lookup Fields for auto-detection (less reliable).
For wildcard sources (Step 9 below) that target multiple tables, specify per-table lookup fields with fully qualified names, separated by semicolons:
test1.inventory=inventory_id,database_name; test1.payment=payment_id,database_name; test1.rental=rental_id,database_name;
Step 9. Handle source / destination schema differences
| Option | What it does |
|---|---|
| Reorder CSV Columns During Extract | Reorders columns in the extracted file to match the target. Ignored if the target table doesn't exist. |
| Ignore Fields in Source Not in Target | Drops source columns the target doesn't have. Mutually exclusive with the next two. |
| Alter Target Table If Source Has Extra Columns | Adds the extra columns to the BigQuery table. |
| Recreate Target Table If Source Has Extra Columns | Drops and recreates the target table. Destructive. |
| Insert NULL Into Target Fields Not in Source | Inserts NULL for any target column the source doesn't have. |
| Create Table SQL | Custom SQL or JavaScript template for creating the target table. See overriding the create-table SQL. |
Notifications on schema mismatch
If the source and target columns don't match, the flow logs an exception. You can wire that exception into an email notification — see sending notifications about failed transformations.
Step 10. Configure mapping (optional)
If you need explicit field-to-field mapping, set it up in Mapping. Mapping is not required; auto-mapping by field name works for most flows.
Incremental change replication (high watermark)
Like any other ETL flow, this flow supports change replication using a high watermark (HWM). Only records that changed since the last run are loaded.
- Set the High Watermark Field — the column that tracks changes.
- Enable Change Replication on the transformation.
- Optional: set a calculated HWM value for clearer tracking.
Load multiple tables with a wildcard
FROM
- Create a single source-to-destination transformation.
- Set FROM to a fully qualified source object name with wildcards, e.g. public.*.
By default every table and view matching the pattern is included.
TO
Set TO to a BigQuery dataset with a wildcard, e.g. dataset.*.
Include / exclude objects
- Exclude Objects: comma-separated list of objects to skip. Use all views to skip every view; all tables to skip every table.
- Include Objects: comma-separated list of objects to keep.
Source query, before/after SQL
Use the {TABLE} token in Source Query, Before LOAD SQL, and After LOAD SQL.
MERGE with wildcards
- Leave Lookup Fields empty and enable Predict Lookup Fields, or
- Specify per-table lookup fields as semicolon-separated fully_qualified_table=field_list pairs.
HWM with wildcards
Use the {TABLE} token in the High Watermark Field Value.
Flow variables
Open the Parameters tab and add user-defined flow variables as key-value pairs. Use uppercase names without spaces.
Maximum number of parallel threads
Multiple source-to-destination transformations (extract-load) can run in parallel threads. The default limit is 5; the maximum is 99. Change it via Maximum Number of Parallel Threads on the Parameters tab.
Troubleshooting
See Common issues when loading data into cloud data warehouses for common errors and fixes.