Azure Synapse Analytics is built on top of Microsoft SQL Server. Therefore, you can use the same techniques you would normally use to work with relational databases in . However, it is important to understand that inserting data into Synapse Analytics row by row can bepainfully slow.
It is recommended that you use Flows optimized to ETL data into Synapse Analytics.
When you need to bulk-load files that already exist in Azure Blob without applying any transformations. The flow automatically generates the COPY command and MERGEs data into the destination.
When configuring a Connection for Azure Storage, which will be used as a stage for the Flows, it is recommended that you select GZip as the value for the Archive file before copying field. The Gzip compression must be disable if you are loading data using Parquet format.
To configure the final destination, click the Connections tab, and select the available Synapse Analytics Connection.
Step 5. Set the source and the destination
Depending upon the Flow type, you can select one of the following sources (FROM) for the Synapse Analytics Flow:
Database: use the database table or view name as the source (FROM) name. The wildcard names, for example,public.*are supported.
Well-known API: use the endpoint name or the file name as the source (FROM) name.
Web Service: use the endpoint name as the source (FROM) name.
File: use the source file name or a wildcard file name as the source (FROM) name.
Queue: use the queue name as the source (FROM) name.
For all Synapse Analytics Flows, the destination Connection is going to be an Azure storage Connection. uses the destination Connection as a stage.
Select or enter the fully-qualified Synapse table name as a destination (TO). If the source (FROM) is entered using a wildcard template, set the destination (TO) to a wildcard as well.
Step 6. Set data exchange Format configured in step 2
Step 7. Set the optional parameters
Click MAPPING in the transformation row and select the Parameters tab.
Optional parameters and transformations:
Column List: you can specify a comma-separated list of column names to load source data fields into specific target columns. The columns can be in any order in the COPY INTO statement, but when loading from flat files, such as in an Azure Storage container, their order must match the order of the source data.
Line Separator: the default is /n on Windows and 0x0a on Linux. You can override it using this property. It is useful if CSV files are generated in one operating system but will be loaded in another.
Max # of Errors to Ignore: specifies the maximum number of reject rows allowed in the load before the COPY operation is canceled. Each row that cannot be imported by the COPY operation is ignored and counted as one error. If max_errors are not specified, the default is 0.
Error File Directory: specifies the directory relative to the container within the COPY statement where the rejected rows and the corresponding error file should be written. If the specified path doesn't exist, one will be created on your behalf. A child directory is created with the name rejectedrows.
Identity Insert: specifies whether the identity value or values in the imported data file are to be used for the identity column. If IDENTITY_INSERT is OFF (default), the identity values for this column are verified but not imported. Azure Synapse Analytics will automatically assign unique values based on the seed and increment values specified during table creation. Note the following behavior with the COPY command: If IDENTITY_INSERT is OFF, and table has an identity column; a column list must be specified which does not map an input field to the identity column. If IDENTITY_INSERT is ON, and table has an identity column, if a column list is passed, it must map an input field to the identity column. The default value is not supported for the IDENTITY COLUMN in the column list. IDENTITY_INSERT can only be set for one table at a time.
Date Format: possible values: mdy| dmy| >ymd | ydm | myd | dym. Specifies the date Format of the date mapping to SQL Server date Formats.
Action: can be either COPY INTO: inserts records from the file(s) into the Synapse Analytics table or MERGE: merges records from the source with the records in the Synapse Analytics table or CDC MERGE: merges records from the CDC-enabled source with the records in the Synapse Analytics table. MERGE and CDC MERGE require configuring Lookup Fields.
How to MERGE: determines how to perform MERGE into the Synapse Analytics table. The default is DELETE/INSERT: DELETE all records in the actual table that also exist in the temp table, then INSERT all records from the temp table into the actual table. If this parameter is set to Native MERGE the flow will execute native Synapse Analytics MERGE SQL. Note that native MERGE requires the table to be created with DISTRIBUTION HASH.
Create tables with DISTRIBUTION = HASH: if this option is enabled the actual table is created as CREATE TABLE (COLUMNS) (WITH DISTRIBUTION = HASH(PK)). A hash-distributed table has a distribution column that is the hash key. IMPORTANT: Native MERGE requires enabling this option.
Lookup Fields: the comma-separated list of fields that uniquely identify the record in the Synapse Analytics table. This field is required if action is set to MERGE or CDC MERGE.
COPY INTO SQL: this is a user-defined COPY SQL. By default, creates COPY SQLautomatically, based on the input and output parameters. You can override it by using this field. Read more about the Synapse Analytics COPY INTO command.
MERGE SQL: this is a user-defined SQL that will be used instead of the default when action is set to MERGE. If nothing is entered in this field, the default MERGE SQL will be used. The following parameters are automatically populated and can be referenced as {TOKEN} in the SQL:
{TABLE}: the table to MERGE data into.
TEMP_TABLE}: the table to merge data from.
{KEY_FIELDS}: the fields uniquely identifying the record in both tables.
{FIELDS}: the fields to INSERT/ UPDATE in the table to MERGE data into.
CDC MERGE SQL: this is a user-defined SQL that will be used instead of the default when action is set to CDC MERGE. If nothing is entered in this field, the default CDC MERGE SQL will be used. The following parameters are automatically populated and can be referenced as {TOKEN}in the SQL:
TABLE}: the table to MERGE data into.
TEMP_TABLE}: the table to merge data from.
{KEY_FIELDS}: the fields uniquely identifying the record in both tables.
{FIELDS}: the fields to INSERT.
{UPDATE_FIELDS}: the fields and values to update in the Format. field=value,field=value
{UPDATE_CONDITIONS}: the WHERE clause to update existing records.
Purge File if Success: if this option is enabled (by default) and the user-defined SQL is not used, will run the COPY INTO command with the purge option enabled. With purge enabled, the staging file will be automatically deleted after the COPY INTO command is successfully executed.
Purge File(s) if Error: if this option is enabled (by default), the staging files will be automatically deleted if a COPY INTO command was executed with an error.
ELT
Before COPY SQL: this SQL will be executed on the Synapse Analytics Connection before running COPY SQL. You can configure Flow to ignore errors when executing this SQL and to execute SQL as a script.
Ignore errors when executing Before COPY SQL: if this option is enabled, and there is an error when Before COPY SQL is executed, the error will be ignored.
Before COPY SQL is a script: if this option is enabled 'before COPY SQL' will be executed as ascript.
After COPY SQL: this SQL will be executed on the Synapse Analytics Connection after running COPY SQL. You can configure flow to ignore errors when executing this SQL and to execute SQL as a script.
Ignore errors when executing After COPY SQL: if this option is enabled, and there is an error when After COPY SQL is executed, the error will be ignored.
After COPY SQL is a script: if this option is enabled 'after COPY SQL' will be executed as aRedshift script.
Debugging
Log each executed SQL statement: enable this flag if you want to log each automatically generated and executed SQL statement, including before/after COPY SQL.
Parameters specific for different source types
Depending on the Flow type, other Flow parameters can be added, as explained below:
Ignore Transformations: if this parameter is enabled, the system will copy files directly into the Synapse Analytics stage before executing the COPY INTO command. This greatly improves the performance of the load but automatically disables any transformations.
f this parameter is enabled, the system will automatically delete loaded files from the source.
: if this parameter is enabled and there is an error during the load the system will automatically delete loaded files from the source.
To MERGE (upsert) existing data in the Synapse Analytics table with new data:
Set the Action to MERGE.
Define the Lookup Fields, the comma-separated list of fields that uniquely identify the record in the Synapse Analytics table.
Alternatively, you can enable the Predict Lookup Fields which, if enabled, will force the Flow to use various algorithms to automatically predict the fields that uniquely identify the record. Note that it is not always possible to correctly detect the unique fields.
When enabling the Predict Lookup Fields (which is not always accurate) is not an option, you can specify the list of table=fields pairs in the Lookup Fields. Use the fully-qualified table names and ; as a separator between table=field pairs.
Step 9. Configure how to handle source and destination schema changes
It is quite typical when the source (for example, the table in the OLTP database) and the destination (Synapse Analytics table) have a different number of columns. It is also entirely possible that the order of columns in the source is different than the order of columns in the destination. In either of these cases, by default, the Flow will fail since the Synapse Analytics COPY INTO command cannot load files that have more or fewer columns than the target table or the order of columns is different.
When setting the source-to-destination transformation, it is possible to configure it to automatically handle schema changes. The following configuration options are available:
Reorder CSV Columns During Extract: if this option is enabled (it is disabled by default), the Flow reorders columns in the data file to match the order of columns in the target Snowflake table. This option is ignored if the target table does not exist.
Ignore fields in the source that do not exist in the target table: if this option is enabled (it is disabled by default) and the source has more fields than the target (Synapse Analytics) table, the system will automatically exclude extra fields. Options 2, 3, and 4 are mutually exclusive.
Alter target table if the source has columns that the target table doesn't have: if this option is enabled (it is disabled by default) and the source has different fields than the target (Synapse Analytics) table, the system will add extra fields to the target table. Options 2, 3, and 4 are mutually exclusive.
Recreate target table if the source has columns that the target table doesn't have: if this option is enabled (it is disabled by default) and the source has different fields than the target (Synapse Analytics) table, the system will automatically drop and recreate the target table. Options 2, 3, and 4 are mutually exclusive.
Insert null into the fields in the target table that do not exist in the source: if this option is enabled (it is disabled by default) and the target (Synapse Analytics) table has different fields than the source, the system will automatically insert NULL values into these fields in the target.
Configure notifications if source and destination have different columns
It is possible to send an email notification if the source and destination have different columns. To configure Flow to send a notification when either source has more columns than the destination or the destination has more columns than the source, use the technique explained in this article.
The only difference is that you don't need to configure the exception handler. If the system detects that the source and the destination have a different number of columns it will add an exception to the log which then can be retrieved and send in the email body as explained above.
Step 10. Optionally configure Mapping
If necessary, you can create a Mapping between the source and destination (Synapse Analytics) fields.
You create the Mapping between the source and destination just like you usually do for any other Flow type.
When change replication is enabled, only the changed records will be loaded into Synapse Analytics.
Basically, all you need to do is set the High Watermark Field and Enable Change Replication for the transformation. Additionally, for better reliability, you can set the calculated High Watermark Field Value.
Load multiple tables by a wildcard name
To load multiple database objects (tables and views) by a wildcard name (without creating individual source-to-destination transformations), follow the same steps as above,except:
FROM
Create a single source-to-destination transformation and set the FROMto afully qualified source database object name with wildcard characters (* and ?), for examplepublic.*
By default, all tables and all views that match the wildcard name in FROM will be included. If you want to exclude specific database objects, enter a comma-separated list of the database objects to exclude in the Exclude Objectsfield. If you want to include only specific database objects, enter a comma-separated list of the database objects to include in the Include Objects.
TO
Set the TO to the db.schema.*, for example, dbo.history.*.
Optionally, configure the Source query. You can use token {TABLE} in the Source query.
BEFORE/AFTER SQL
If you want to execute any code before and/or after COPY, use token {TABLE}in the Before COPY SQLand After COPY SQL.
MERGE
If you are configuring MERGE action,do notenter anything in the Lookup Fields. Enable Predict Lookup Fields instead.
An alternative to enabling the Predict Lookup Fields option (which is not always accurate), is by specifying the list of table=fields pairs in the Lookup Fields. Use the fully-qualified table names and ; as a separator between table=fields pairs.
If you are setting up the high watermark change replication with a calculated HWM field, use token{TABLE}in the High Watermark Field Value.
Flow variables
To add user-defined flow variables open Parameters tab, add variables as key-value pairs. Use uppercase characters and no spaces for variable names.
Maximum number of parallel threads
It is possible to execute multiple source-to-destination transformations (extract-load) inparallel threads.
By default, limits the number of threads to 5. To change the limit open Parameters tab, set the max number of threads in Maximum Number of Parallel Threads. This value cannot be greater than 99.
Comments
0 comments
Please sign in to leave a comment.