Extract, transform, and load data in Databricks – Etlworks Support

This flow extracts data from any source, optionally transforms it, and loads it into Databricks. Etlworks stages each batch as a file in cloud storage or a Databricks Volume, then loads the file into the destination Delta table using COPY INTO. INSERT, MERGE, and CDC MERGE actions are supported. The destination table is auto-created (as a Delta table) and auto-altered for schema drift.

Which Databricks flow should I use?

Flow	You are here?	Use when
Any to Databricks (Database / File / Queue / Web service / Well-known API)	Yes	Extract from any source, optionally transform, and load into Databricks.
Bulk load files into Databricks		Files already exist in an external stage. No transformation. Auto-generates COPY INTO; supports MERGE.
Stream CDC events into Databricks		Real-time replication from a CDC-enabled source database.
Stream messages from a queue into Databricks		Real-time ingestion from a streaming-capable message queue.

How does this flow work?

Etlworks extracts rows from the source (database query, file, API call, queue message, …).
Rows are written to a staging file in cloud storage or a Databricks Volume. CSV is the default; JSON, Parquet, and Avro are also supported.
Etlworks executes COPY INTO against the destination Databricks table. The Delta table is created automatically if it doesn't exist (with USING DELTA).
For MERGE, the staging file is first loaded into a temporary Delta table, then merged into the destination using one of three strategies (see Configure MERGE below).
On success, staging files are purged.

What do I need before I start?

A Databricks connection (see Get started with Databricks for setup).
A stage connection — S3, ADLS Gen2, GCS, server storage, or Databricks Volume.
The user / service principal must have READ FILES on the stage location's external location or volume, plus SELECT, MODIFY, and (for auto-create) CREATE TABLE on the target schema.

Step-by-step setup

Step 1. Create connections

You need at least two connections:

The source connection (database, file, API, queue, …).
The Databricks connection (PAT or OAuth Service Principal).
A stage connection if you're using cloud storage or a Volume separate from the Databricks workspace.

Step 2. Create a data exchange format

Create a CSV format (the default and most efficient for Databricks COPY INTO). JSON, Parquet, and Avro are also supported — pick based on the source shape and downstream needs.

Step 3. Create the flow

In Flows, click + and pick a flow type matching the source: Database to Databricks, File to Databricks, Web service to Databricks, Well-known API to Databricks, or Queue to Databricks.

Step 4. Set the Databricks connection on the flow

Set the named connection Databricks on the flow to the Databricks connection from step 1.

Step 5. Set the stage connection

If the destination cluster reads from cloud storage or a Volume, set the stage connection on the flow. For S3, ADLS Gen2, and GCS, Etlworks emits the stage path (s3://, abfss://, gs://) into the COPY INTO command. For Databricks Volumes, point a server-storage connection at /Volumes/<catalog>/<schema>/<volume>/<path>.

Step 6. Set source (FROM) and destination (TO)

In the transformation, set FROM to the source connection and object (table, file, endpoint, …). Set TO to the destination Databricks table. Use fully qualified catalog.schema.table if your workspace uses Unity Catalog.

Step 7. Set parameters

The Parameters tab includes Databricks-specific options:

Parameter	What it does
Action	COPY (default) or MERGE. See Configure MERGE.
Databricks Credential	Optional storage credential clause for COPY INTO. Either a named credential (CREDENTIAL my_cred) or inline (AWS_ACCESS_KEY = '…', AWS_SECRET_KEY = '…' or AZURE_SAS_TOKEN = '…'). Leave blank when the workspace has an External Location covering the stage path or when the stage is a managed Volume.
Override COPY INTO SQL	Provide the full COPY INTO statement yourself. Disables auto-generation.
Line Separator	Override the line ending sent to FORMAT_OPTIONS.
Purge File if Success	Delete the staging file after a successful load (default true).
Purge File(s) if Error	Delete the staging file on failure (default true).
Before COPY SQL / After COPY SQL	SQL run on the Databricks connection before / after each COPY INTO. Multiple ;-separated statements supported. Useful for OPTIMIZE, VACUUM, Z-ORDER BY, statistics refresh, or partition maintenance.

Step 8. Configure MERGE (UPSERT) — optional

Set Action to MERGE when you need UPSERT semantics. Enter the lookup fields in Lookup Fields (comma-separated) or enable Predict Lookup Fields to have Etlworks infer them from the destination's primary or unique key.

Pick a MERGE strategy in How to MERGE:

Strategy	How it works	When to use
DELETE / INSERT (default)	Load staging file into a temp Delta table. DELETE matching rows from the destination, then INSERT from the temp table.	Largest batches and where the destination table is large. Avoids the per-row scan cost of MERGE INTO.
Native MERGE	Generate a single MERGE INTO statement against the temp Delta table.	Small to mid-size batches; correctness-first scenarios.
Native MERGE with separate DELETE	Two-statement variant: a separate DELETE for rows flagged as deleted, then a MERGE INTO for the upsert.	CDC streams where soft-delete markers must be honored independently of UPSERTs.

For custom MERGE logic, use User-defined MERGE SQL with these tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}, {INSERT_FIELDS}, {UPDATE_FIELDS}, {UPDATE_CONDITIONS}.

Step 9. Handle source / destination schema differences

Two Databricks-specific options affect how the engine handles schema drift — both default to true and are unusual relative to other warehouses:

Create temp tables with only TEXT columns (default true). The staging Delta table is created with every column as STRING, then the engine casts during COPY INTO / INSERT INTO. This protects the load against minor source schema changes (extra column on day 2, type widening, …) and is the main schema-drift resilience knob.
Alter target table if source has new columns (default false). Enable to have Etlworks ALTER TABLE ADD COLUMNS when the source has columns the destination doesn't.

Also available: Recreate target table, Insert null into fields that don't exist in source.

Step 10. Configure mapping — optional

If source and destination columns don't match by name, configure per-field mapping.

Incremental change replication (high watermark)

To extract only changed rows on each run, set the High Watermark Field on the transformation and pick a watermark type (timestamp or monotonic sequence). For details, see Change replication using High Watermark.

For real-time replication from a transaction log, use Stream CDC events into Databricks instead of HWM.

Load multiple tables with a wildcard

One transformation can load many tables when the source is a database.

FROM

Enter a wildcard source name — for example public.* for all tables in the source schema public.

TO

Use the asterisk to map source names to destination names. Example: analytics.public.* maps source test.dbo.patient to destination analytics.public.patient.

Include / exclude objects

Use Include objects and Exclude objects to filter the wildcard match. See Source-to-destination transformation for the full syntax.

MERGE with wildcards

For per-table lookup fields, enter table=fields pairs separated by ;:

analytics.public.inventory = inventory_id, database_name;
analytics.public.payment   = payment_id, database_name;
analytics.public.rental    = rental_id, database_name;

Or enable Predict Lookup Fields to let the flow infer keys per table.

HWM with wildcards

HWM works alongside wildcard sources; the watermark value is tracked per table.

Troubleshooting

For common load issues, see Common issues when loading data into cloud data warehouses.

Articles in this section