This flow extracts data from any source, optionally transforms it, and loads it into Databricks. Etlworks stages each batch as a file in cloud storage or a Databricks Volume, then loads the file into the destination Delta table using COPY INTO. INSERT, MERGE, and CDC MERGE actions are supported. The destination table is auto-created (as a Delta table) and auto-altered for schema drift.
Which Databricks flow should I use?
| Flow | You are here? | Use when |
|---|---|---|
| Any to Databricks (Database / File / Queue / Web service / Well-known API) | Yes | Extract from any source, optionally transform, and load into Databricks. |
| Bulk load files into Databricks | Files already exist in an external stage. No transformation. Auto-generates COPY INTO; supports MERGE. | |
| Stream CDC events into Databricks | Real-time replication from a CDC-enabled source database. | |
| Stream messages from a queue into Databricks | Real-time ingestion from a streaming-capable message queue. |
How does this flow work?
- Etlworks extracts rows from the source (database query, file, API call, queue message, …).
- Rows are written to a staging file in cloud storage or a Databricks Volume. CSV is the default; JSON, Parquet, and Avro are also supported.
- Etlworks executes COPY INTO against the destination Databricks table. The Delta table is created automatically if it doesn't exist (with USING DELTA).
- For MERGE, the staging file is first loaded into a temporary Delta table, then merged into the destination using one of three strategies (see Configure MERGE below).
- On success, staging files are purged.
What do I need before I start?
- A Databricks connection (see Get started with Databricks for setup).
- A stage connection — S3, ADLS Gen2, GCS, server storage, or Databricks Volume.
- The user / service principal must have READ FILES on the stage location's external location or volume, plus SELECT, MODIFY, and (for auto-create) CREATE TABLE on the target schema.
Step-by-step setup
Step 1. Create connections
You need at least two connections:
- The source connection (database, file, API, queue, …).
- The Databricks connection (PAT or OAuth Service Principal).
- A stage connection if you're using cloud storage or a Volume separate from the Databricks workspace.
Step 2. Create a data exchange format
Create a CSV format (the default and most efficient for Databricks COPY INTO). JSON, Parquet, and Avro are also supported — pick based on the source shape and downstream needs.
Step 3. Create the flow
In Flows, click + and pick a flow type matching the source: Database to Databricks, File to Databricks, Web service to Databricks, Well-known API to Databricks, or Queue to Databricks.
Step 4. Set the Databricks connection on the flow
Set the named connection Databricks on the flow to the Databricks connection from step 1.
Step 5. Set the stage connection
If the destination cluster reads from cloud storage or a Volume, set the stage connection on the flow. For S3, ADLS Gen2, and GCS, Etlworks emits the stage path (s3://, abfss://, gs://) into the COPY INTO command. For Databricks Volumes, point a server-storage connection at /Volumes/<catalog>/<schema>/<volume>/<path>.
Step 6. Set source (FROM) and destination (TO)
In the transformation, set FROM to the source connection and object (table, file, endpoint, …). Set TO to the destination Databricks table. Use fully qualified catalog.schema.table if your workspace uses Unity Catalog.
Step 7. Set parameters
The Parameters tab includes Databricks-specific options:
| Parameter | What it does |
|---|---|
| Action | COPY (default) or MERGE. See Configure MERGE. |
| Databricks Credential | Optional storage credential clause for COPY INTO. Either a named credential (CREDENTIAL my_cred) or inline (AWS_ACCESS_KEY = '…', AWS_SECRET_KEY = '…' or AZURE_SAS_TOKEN = '…'). Leave blank when the workspace has an External Location covering the stage path or when the stage is a managed Volume. |
| Override COPY INTO SQL | Provide the full COPY INTO statement yourself. Disables auto-generation. |
| Line Separator | Override the line ending sent to FORMAT_OPTIONS. |
| Purge File if Success | Delete the staging file after a successful load (default true). |
| Purge File(s) if Error | Delete the staging file on failure (default true). |
| Before COPY SQL / After COPY SQL | SQL run on the Databricks connection before / after each COPY INTO. Multiple ;-separated statements supported. Useful for OPTIMIZE, VACUUM, Z-ORDER BY, statistics refresh, or partition maintenance. |
Step 8. Configure MERGE (UPSERT) — optional
Set Action to MERGE when you need UPSERT semantics. Enter the lookup fields in Lookup Fields (comma-separated) or enable Predict Lookup Fields to have Etlworks infer them from the destination's primary or unique key.
Pick a MERGE strategy in How to MERGE:
| Strategy | How it works | When to use |
|---|---|---|
| DELETE / INSERT (default) | Load staging file into a temp Delta table. DELETE matching rows from the destination, then INSERT from the temp table. | Largest batches and where the destination table is large. Avoids the per-row scan cost of MERGE INTO. |
| Native MERGE | Generate a single MERGE INTO statement against the temp Delta table. | Small to mid-size batches; correctness-first scenarios. |
| Native MERGE with separate DELETE | Two-statement variant: a separate DELETE for rows flagged as deleted, then a MERGE INTO for the upsert. | CDC streams where soft-delete markers must be honored independently of UPSERTs. |
For custom MERGE logic, use User-defined MERGE SQL with these tokens: {TABLE}, {TEMP_TABLE}, {KEY_FIELDS}, {FIELDS}, {INSERT_FIELDS}, {UPDATE_FIELDS}, {UPDATE_CONDITIONS}.
Step 9. Handle source / destination schema differences
Two Databricks-specific options affect how the engine handles schema drift — both default to true and are unusual relative to other warehouses:
- Create temp tables with only TEXT columns (default true). The staging Delta table is created with every column as STRING, then the engine casts during COPY INTO / INSERT INTO. This protects the load against minor source schema changes (extra column on day 2, type widening, …) and is the main schema-drift resilience knob.
- Alter target table if source has new columns (default false). Enable to have Etlworks ALTER TABLE ADD COLUMNS when the source has columns the destination doesn't.
Also available: Recreate target table, Insert null into fields that don't exist in source.
Step 10. Configure mapping — optional
If source and destination columns don't match by name, configure per-field mapping.
Incremental change replication (high watermark)
To extract only changed rows on each run, set the High Watermark Field on the transformation and pick a watermark type (timestamp or monotonic sequence). For details, see Change replication using High Watermark.
For real-time replication from a transaction log, use Stream CDC events into Databricks instead of HWM.
Load multiple tables with a wildcard
One transformation can load many tables when the source is a database.
FROM
Enter a wildcard source name — for example public.* for all tables in the source schema public.
TO
Use the asterisk to map source names to destination names. Example: analytics.public.* maps source test.dbo.patient to destination analytics.public.patient.
Include / exclude objects
Use Include objects and Exclude objects to filter the wildcard match. See Source-to-destination transformation for the full syntax.
MERGE with wildcards
For per-table lookup fields, enter table=fields pairs separated by ;:
analytics.public.inventory = inventory_id, database_name;
analytics.public.payment = payment_id, database_name;
analytics.public.rental = rental_id, database_name;Or enable Predict Lookup Fields to let the flow infer keys per table.
HWM with wildcards
HWM works alongside wildcard sources; the watermark value is tracked per table.
Troubleshooting
For common load issues, see Common issues when loading data into cloud data warehouses.