When to use this Flow type
If you already have CSV files in Google Cloud storage and don't need to transform the data, the most efficient way to load these files into Google BigQuery is to use the Flow typeBulk load files in Google Cloud Storage into BigQuery
described in this article.
Use this flow type
- When you simply want to load CSV files that already exist in Google storage into BigQuery, without applying any transformations or transferring files over the network.
- When you want to load CSV files created by the CDC Flow.
Flows optimized for BigQuery
Flow type | When to use | |
|
|
When you need to extract data from any source, transform it and load it into BigQuery. |
Bulk load files in Google Cloud Storage into BigQuery | You are here | When you need to bulk-load files that already exist in Google Cloud Storage without applying any transformations. The flow automatically generates the COPY command and MERGEs data into the destination. |
Stream CDC events into BigQuery | When you need to stream updates from the database which supports Change Data Capture (CDC) into BigQuery in real-time. | |
Stream messages from a queue into BigQuery | When you need to stream messages from the message queue which supports streaming into BigQuery in real time. |
How it works
- The Flow reads the names of all files matching the wildcard in the specific Google storage bucket. It can traverse the subfolders as well.
- The Flow calculates the destination table names based on the source file names.
- The Flow creates multiple threads, one per destination table but not more than the configurable threshold.
- The Flow creates an external table (one per actual table) using data in cloud storage.
- The Flow mergers data in the temporary external tables with data in the actual tables.
Features
The Flow supportsCOPY
(INSERT),MERGE
(UPSERT), andCDC MERGE
. WithCDC MERGE
, the Flow updates the actual destination table by applyingINSERT
/UPDATE
/DELETE
events in the same order as they originated in the source database and recorded in CSV files. It performs it in the most efficient way possible.
Other Features | Description |
Monitor source schema changes | The Flow is able toCREATE andALTER destination tables to keep up with schema changes in the source. |
Delete loaded source files | The Flow can be configured to delete successfully loaded source files. |
Load data in parallel | The Flow is able to load data into multiple destination tables in parallel. |
Process files in subfolders | The Flow is able to process files in multiple nested subfolders. |
Prerequisites
- The Google BigQuery instance must be available from the Internet.
- The Google cloud storage account exists, and the user has read/write permissions for the specific bucket.
Process
Step 1. Create a new Google Storage Connection
This Connection will be used as a source to access files stored in the Google storage bucket.
Step 2. Create a new Google BigQuery connection
This Connection will be used as a destination.
Step 3. Create a new CSV Format
This Format will be used to create an external table in BigQuery.
Create a CSV Format. Set the date and time Formats to the Formats understood by BigQuery:yyyy-MM-dd HH:mm:ss.SSS
.
Step 4. Create a new Flow
This Flow will be used to load files in the Google storage bucket into BigQuery.
InFlows
click [+]
, type inbigquery
, and select the Flow Bulk load files in Google Cloud Storage into BigQuery
.
Step 5. Configure load transformation
Select or enter the following attributes of the transformation (left to right):
- Google storage Connection created in step 1.
- CSV Format created in step 3.
- A wildcard filename that matches the filenames in the cloud storage. Use
.csv
extension for uncompressed files and.gz
for compressed. - BigQuery Connection created in step 2.
- The wildcard destination table name in the following format:
project.schema.prefix_*_suffix
, whereproject.schema
is a BigQuery project and schema to load data into. Example:project123.test.*
. You don't need to use wildcards when loading into a specific table.
Step 6. Set optional and required parameters
ClickMAPPING
.
Select theParameters
tab.
Select or enter the following optional and required parameters:
Source Files and Destination Tables
Include files in subfolders
: if this option is enabled, the Flow will load all files that match the wildcard filename (set inFROM
) in all subfolders under the main bucket or blob.Exclude and Exclude files
: the optional comma-separated list of files to exclude and/or include. You can use wildcard file names. These options are used together with the wildcard filename set inFROM
:- The Flow populates the list of files matching a wildcard filename in
FROM
. - The Flow excludes files based on the list of exclusions set in
Exclude files
. - The Flow includes files based on the list of inclusion set in
Include files
.
- The Flow populates the list of files matching a wildcard filename in
Calculate BigQuery table name
: This is an optional parameter used to calculate (using JavaScript) the destination (Redshift) table name based on the source file name. The original file name, without the extension, is stored in the variablename
and the actual table name must be assigned to the variablevalue
. For example, let's assume that the source file name isdb_schema-table_cdc_stream_uuid.csv
. The JavaScript code to calculate the destination table name will bevalue = name.substring(name.indexOf('-') + 1, name.indexOf('_cdc_stream'));
. IMPORTANT: the flow will automatically recognize and transform filenames matching the following template:*_cdc_stream_*.*
. If filenames are matching this template there is no need to provide a transformation script.Maximum Number of Files to Process
: the optional parameter that controls how many files will be loaded in one batch (one run of the Flow). If you expect that at any point in time there will be hundreds of thousands or millions (10^6 - 10^7) of files in the source folder(s), it is recommended to limit the number of files to process to more manageable: hundreds to thousands (10^3-10^5).Maximum Number of Parallel Loads
: if the value of this optional parameter is greater than 1, the Flow will create multiple threads to load data in parallel –– one thread per table but not more than the threshold set in this parameter.Parallel
: enable this flag if you have more than one source-to-destination transformation that you want to execute in parallel.Purge File if Success
: if this option is enabled (by default), the source files will be automatically deleted after theCOPY
command is successfully executed.
Debug
Log each executed SQL statement
: enable this flag if you want to log each automatically generated and executed SQL statement, including before/after Load SQL.
Load Parameters
Decimal Target Types
: determines how to convert a Decimal type. Equivalent toExternalDataConfiguration.decimal_target_types
. Example:["NUMERIC", "BIGNUMERIC"]
.String that represents NULL values in a CSV file
: any string that represents SQL NULL.Ignore Unknown Values
: if true, ignore extra values that are not represented in the table schema without returning an error.Max # of Errors to Ignore
: the maximum number of bad records to ignore when reading the data.
Load and Merge
Action
: can beINSERT
which inserts records from the file(s) into the BigQuery table,MERGE
which merges records from the file with the records in the BigQuery table, andCDC MERGE
which loads CDC events (INSERT
/UPDATE
/DELETE
) in the same order as they originated in the source database.MERGE
andCDC MERGE
require configuring theLookup Fields
or/and enabling the.Lookup Fields
: the comma-separated list of fields that uniquely identify the record in the target BigQuery table.- If this option is enabled and
Lookup Fields
is empty, the system will try to predict the columns that uniquely identify the record in the target BigQuery table. User-defined MERGE SQL
: by default, the MERGE SQL is generated automatically. Use this parameter to override automatically generated SQL. The following {tokens} are available:- {TABLE} - the destination table.
- {TEMP_TABLE} - the staging or temp table.
- {FIELDS} - the fields to INSERT/UPDATE in the destination table.
- {INSERT_FIELDS} - the values of the fields to INSERT into the destination table.
- {UPDATE_FIELDS} - the fields and values to UPDATE in the format field=value,field=value.
- {UPDATE_CONDITIONS} - the condition to match records in the actual destination and temp/staging tables.
ELT
Set variables for Before and After SQL
: code in JavaScript to set the variables that can be referenced as {varname} in the Before and/or After SQL. To set the variable add it as a key-value pair to the script variable 'vars'.- Example:
vars.put('schema', 'abc');
- Example of the SQL:
insert into test (schema_name) values ('{schema}')
- Example:
Before Load SQL
: this SQL will be executed on the BigQuery Connection before runningCOPY SQL
.Ignore errors when executing Before Load SQL
: if this option is enabled, and there is an error whenBefore Load SQL
is executed, the error will be ignored.Before Load SQL is a script
: if this option is enabled,Before Load SQL
will be executed as a BigQuery Script.After Load SQL
: this SQL will be executed on the BigQuery Connection after runningCOPY SQL
.Ignore errors when executing After Load SQL
: if this option is enabled and there is an error whenAfter Load SQL
is executed, the error will be ignored.After Load SQL is a script
: if this option is enabled,After Load SQL
will be executed as a BigQuery script.
Handling source schema changes
Alter target table if the source has columns that the target table doesn't have
: if this option is enabled (it is disabled by default), and the source has different fields than the target (BigQuery) table, the system will add extra fields to the target table.Create all tables with only TEXT columns
: if this option is enabled (it is disabled by default) the flow will create all tables with only TEXT columns. Enable this option to make the load more resilient to the source schema drift. By default, the flow creates the actual table using the first file as a pattern. It then alters the actual table by adding new columns but it never modifies the data types of the existing columns which could cause issues if there are multiple source files to load into the same destination table and some of these files have columns with different data types (compare to each other).Tables with all TEXT columns
: a comma-separated list of regular expressions matching the table names for which the flow should create all columns asTEXT
.
Here are the rules for regexes:
-
- Regexes are case insensitive.
- If regex includes comma (
,
) it must be escaped (/,
). - Java regex syntax must be used.
- If the destination table name is configured using fully qualified (DB.SCHEMA.TABLE) or partially qualified (SCHEMA.TABLE) notation the regex should reflect it. Example (include tables
country
andcity
):text.*
->.*country,.*city
.
TEXT columns
: a comma-separated list of regular expressions matching the column names which should be created asTEXT
.
Here are the rules for 3 parameters above:
-
-
- If
Create all actual tables with only TEXT columns
is enabled all columns in all new tables and new columns in the existing tables will be created asTEXT
. - If the table name matches the regex in
Actual tables with all TEXT columns
andTEXT columns
is empty all columns in that table will be created asTEXT
. - If
TEXT columns
is not empty and the column name in any table matches any regex configured in this field the column will be created asTEXT
.
- If
-
Step 7. Optionally, add a mapping.
The mapping can be used to:
- Rename the column.
- Exclude the column.
- Change the column data type for tables automatically created by the flow.
Configure mapping
Select transformation and clickMAPPING
.
Add mapping for columnCURRENT_DATE
.
The renaming or excluding the column in mapping will work for all tables that have this column. If the table does not have the column the mapping will be ignored.
Step 8. Optionally, add more source-to-destination transformations
Click[+]
and repeat steps 5 to 6 for more locations and/or file name templates.
Comments
0 comments
Please sign in to leave a comment.