Google BigQuery is a serverless, highly scalable, and cost-effective data warehouse designed for business agility.
You can use the same techniques you would normally use to work with relational databases in Etlworks. However, it is important to understand that inserting data into Google BigQuery row by row can bepainfully slow.
It is recommended that you use Flows optimized to ETL data into BigQuery.
When you need to bulk-load files that already exist in Google Cloud Storage without applying any transformations. The flow automatically generates the COPY command and MERGEs data into the destination.
To configure the final destination, click the Connections tab, and select the available BigQuery Connection.
Step 5. Set the source and the destination
Depending upon the Flow type, you can select one of the following sources (FROM) for the Goole BigQuery Flow:
Database: use the database table or view name as the source (FROM) name. The wildcard names, for example,public.*are supported.
Well-known API: use the endpoint name or the file name as the source (FROM) name.
Web Service: use the endpoint name as the source (FROM) name.
File: use the source file name or a wildcard file name as the source (FROM) name.
Queue: use the queue name as the source (FROM) name.
For all BigQuery Flows, the destination Connection is going to be a Google Storage Connection. Etlworks uses the destination Connection as a stage.
Select or enter the fully qualified or wildcard BigQuery table name as a destination (TO). If the source (FROM) is entered using a wildcard template, set the destination (TO) to a wildcard as well.
Step 6. Set data exchange Format configured in step 2
Step 7. Set the optional parameters
Click MAPPING in the transformation row and select the Parameters tab.
Optional parameters and transformations:
Column List: you can specify a comma-separated list of column names in the external table in the Format column_name1 column_schema1,column_name2 column_schema2. The columns must be in the same order as in the file.
Decimal Target Types: determines how to convert a Decimal type. Equivalent to ExternalDataConfiguration.decimal_target_types. Example: ["NUMERIC", "BIGNUMERIC"].
String that represents NULL values in a CSV file: any string that represents SQL NULL.
Ignore Unknown Values: if true, ignore extra values that are not represented in the table schema without returning an error.
Max # of Errors to Ignore: the maximum number of bad records to ignore when reading the data.
Action: can be any of the following:
INSERT: inserts records into the BigQuery table.
MERGE: merges records from the source with the records in the BigQuery table.
CDC MERGE: merges records from the CDC-enabled source with the records in the BigQuery table. MERGE and CDC MERGE require configuring Lookup Fields.
Lookup Fields: the comma-separated list of fields that uniquely identify the record in the BigQuery table. This field is required if action is set to MERGE or CDC MERGE.
MERGE SQL: this is a user-defined SQL that will be used instead of the default when action is set to MERGE. If nothing is entered in this field, the default MERGE SQL will be used. The following parameters are automatically populated and can be referenced as {TOKEN} in the SQL:
{TABLE}: the table to MERGE data into.
TEMP_TABLE}: the table to merge data from.
{KEY_FIELDS}: the fields uniquely identifying the record in both tables.
{FIELDS}: the fields to INSERT/ UPDATE in the table to MERGE data into.
CDC MERGE SQL: this is a user-defined SQL that will be used instead of the default when action is set to CDC MERGE. If nothing is entered in this field, the default CDC MERGE SQL will be used. The following parameters are automatically populated and can be referenced as {TOKEN} in the SQL:
TABLE}: the table to MERGE data into.
TEMP_TABLE}: the table to merge data from.
{KEY_FIELDS}: the fields uniquely identifying the record in both tables.
{FIELDS}: the fields to INSERT.
{UPDATE_FIELDS}: the fields and values to update in the Format. field=value,field=value
{UPDATE_CONDITIONS}: the WHERE clause to update existing records.
Purge File if Success: if this option is enabled (by default) and the user-defined SQL is not used, Etlworks will run the COPY INTO command with the purge option enabled. With purge enabled, the staging file will be automatically deleted after the COPY INTO command is successfully executed.
Purge File(s) if Error: if this option is enabled (by default), the staging files will be automatically deleted if a COPY INTO command was executed with an error.
ELT
Before LOAD SQL: this SQL will be executed on the BigQuery Connection before loading data. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.
Ignore errors when executing Before LOAD SQL: if this option is enabled, and there is an error when Before LOAD SQL is executed, the error will be ignored.
Before LOAD SQL is a script: if this option is enabled 'before LOAD SQL' will be executed as ascript.
After LOAD SQL: this SQL will be executed on the BigQuery Connection after loading data. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.
Ignore errors when executing After LOAD SQL: if this option is enabled, and there is an error when After LOAD SQL is executed, the error will be ignored.
After LOAD SQL is a script: if this option is enabled 'after LOAD SQL' will be executed as ascript.
Debugging
Log each executed SQL statement: enable this flag if you want to log each automatically generated and executed SQL statement, including before/after LOAD SQL.
Parameters specific for different source types
Depending on the Flow type, other Flow parameters can be added, as explained below:
Ignore Transformations: if this parameter is enabled, the system will copy files directly into the Google cloud storage before creating the external table and loading data into the actual table. This greatly improves the performance of the load but automatically disables any transformations.
f this parameter is enabled, the system will automatically delete loaded files from the source.
: if this parameter is enabled and there is an error during the load, the system will automatically delete loaded files from the source.
To MERGE (UPSERT) existing data in the BigQuery table with new data:
Set the Action to MERGE.
Define the Lookup Fields, the comma-separated list of fields that uniquely identify the record in the Google BigQuery table.
Alternatively, you can enable the Predict Lookup Fields which will force the Flow to use various algorithms to automatically predict the fields that uniquely identify the record. Note that it is not always possible to detect the unique fields correctly.
When enabling the Predict Lookup Fields (which is not always accurate) is not an option, you can specify the list of table=fields pairs in the Lookup Fields. Use the fully-qualified table names and ; as a separator between table=field pairs.
Step 9. Configure how to handle source and destination schema changes
It is typical when the source (for example, the table in the OLTP database) and the destination (Google BigQuery table) have a different number of columns. It is also entirely possible that the order of columns in the source is different from that in the destination.
When setting the source-to-destination transformation, it is possible to configure it to automatically handle schema changes. The following configuration options are available:
Reorder CSV Columns During Extract: if this option is enabled (it is disabled by default), the Flow reorders columns in the data file to match the order of columns in the target table. This option is ignored if the target table does not exist.
Ignore fields in the source that do not exist in the target table:if this option is enabled (it is disabled by default) and the source has more fields than the target table, the system will automatically exclude extra fields. Options 2, 3, and 4 are mutually exclusive.
Alter target table if the source has columns that the target table doesn't have: if this option is enabled (it is disabled by default) and the source has different fields than the target table, the system will add extra fields to the target table. Options 2, 3, and 4 are mutually exclusive.
Recreate target table if the source has columns that the target table doesn't have: if this option is enabled (it is disabled by default) and the source has different fields than the target table, the system will automatically drop and recreate the target table. Options 2, 3, and 4 are mutually exclusive.
Insert null into the fields in the target table that do not exist in the source: if this option is enabled (it is disabled by default) and the target table has different fields than the source, the system will automatically insert NULL values into these fields in the target.
Create table SQL: by default system automatically generates an SQL for creating a new table if it does not exist. You can override the default SQL by providing a template or a JavaScript code. Read more about how to override the create table SQL.
Configure notifications if source and destination have different columns
It is possible to send an email notification if the source and destination have different columns. To configure the Flow to send a notification when either source has more columns than the destination or the destination has more columns than the source, use the technique explained in this article.
The only difference is that you don't need to configure the exception handler. If the system detects that the source and the destination have a different number of columns, it will add an exception to the log that can be retrieved and sent in the email body as explained above.
Step 10. Optionally configure Mapping
If necessary, you can create a Mapping between the source and destination (Google BigQuery) fields.
You create the Mapping between the source and destination just like you usually do for any other Flow type.
When change replication is enabled, only the changed records will be loaded into BigQuery.
Basically, all you need to do is set the High Watermark Field and Enable Change Replication for the transformation. Additionally, for better reliability, you can set the calculated High Watermark Field Value.
Load multiple tables by a wildcard name
To load multiple database objects (tables and views) by a wildcard name (without creating individual source-to-destination transformations), follow the same steps as above,except:
FROM
Create a single source-to-destination transformation and set the FROMto afully qualified source database object name with wildcard characters (* and ?), for example,public.*.
By default, all tables and all views that match the wildcard name in FROM will be included. If you want to exclude specific database objects, enter a comma-separated list of the database objects to exclude in the Exclude Objectsfield. If you want to include only specific database objects, enter a comma-separated list of the database objects to include in the Include Objects.
TO
Set the TO to the db.schema.*, for example, dbo.history.*.
Optionally, configure the Source query. You can use token {TABLE} in the Source query.
BEFORE/AFTER SQL
If you want to execute any code before and/or after LOAD, use token {TABLE}in the Before LOAD SQLand After LOAD SQL.
MERGE
If you are configuring MERGE action,do notenter anything in the Lookup Fields. Enable Predict Lookup Fields instead.
An alternative to enabling the Predict Lookup Fields option (which is not always accurate), is by specifying the list of table=fields pairs in the Lookup Fields. Use the fully-qualified table names and ; as a separator between table=fields pairs.
If you are setting up the high watermark change replication with a calculated HWM field, use token{TABLE}in the High Watermark Field Value.
Flow variables
To add user-defined Flow variables open Parameters tab, add variables as key-value pairs. Use uppercase characters and no spaces for variable names.
Maximum number of parallel threads
It is possible to execute multiple source-to-destination transformations (extract-load) inparallel threads.
By default, Etlworks limits the number of threads to 5. To change the limit, open Parameters tab, set the max number of threads in Maximum Number of Parallel Threads. This value cannot be greater than 99.
Comments
0 comments
Please sign in to leave a comment.