Bulk load files into Databricks – Etlworks Support

Bulk load files that already exist in cloud storage, server storage, or a Databricks Volume into a Databricks Delta table. The flow auto-generates COPY INTO and supports INSERT, MERGE, and CDC MERGE actions.

When to use this flow

Use this flow when the data is already in files (typically CSV, JSON, Parquet, or Avro) and no transformation is needed before loading into Databricks. Common scenarios:

Files dropped daily by a vendor into S3 / ADLS / GCS.
Files produced by another flow (Etlworks or external) into a Databricks Volume.
A periodic catch-up after a streaming CDC pause.

If you need transformation, use Any to Databricks instead.

How does this flow work?

Etlworks iterates over the files in the stage matching the configured pattern.
For each file, Etlworks issues COPY INTO against the destination Delta table (auto-created with USING DELTA if missing).
For MERGE, the file is loaded into a temp Delta table first, then merged into the destination using the configured strategy.
On success, processed files can be deleted, archived, or skipped on the next run.

Features

File formats: CSV, JSON, Parquet, Avro.
Stage types: S3, ADLS Gen2, GCS, server storage, Databricks Volumes (managed and external).
Auto-generated COPY INTO with FORMAT_OPTIONS for delimiter, quote, header, line separator, date format, timestamp format.
MERGE strategies: DELETE / INSERT, Native MERGE, Native MERGE with separate DELETE.
Wildcard / regex file selection. Recursive subfolder scanning.
Parallel loads across multiple destination tables.
Skip-already-processed-files cache.
Schema drift resilience via TEXT-only temp tables (see Schema drift).

What do I need before I start?

A Databricks connection. See Get started with Databricks for setup.
A stage connection (S3, ADLS Gen2, GCS, server storage, or Volume) and read permission for the user / service principal on that location.
Files already present in the stage, or arriving on a schedule.

Step-by-step setup

Step 1. Create the stage connection

Pick one based on where your files live:

Amazon S3 — for AWS-hosted workspaces.
Azure Storage (ADLS Gen2) — for Azure workspaces.
Google Cloud Storage — for GCP workspaces.
Server storage — pointed at a Databricks Volume path (/Volumes/<catalog>/<schema>/<volume>/…) or a directory on the Etlworks host.

Step 2. Create the Databricks connection

See Get started with Databricks for connection setup. PAT for development, OAuth Service Principal for production.

Step 3. Create the format

Match the file format on disk: CSV, JSON, Parquet, or Avro.

Step 4. Create the flow

In Flows, click + and pick Bulk load files into Databricks.

Step 5. Configure the load transformation

Add a transformation per destination table. Set FROM to the stage connection, format, and the file name or wildcard. Set TO to the Databricks connection and destination table (use catalog.schema.table for Unity Catalog).

Step 6. Set parameters

Databricks-specific load parameters:

Parameter	What it does
Action	INSERT (default), MERGE, or CDC MERGE.
How to MERGE	For MERGE / CDC MERGE: DELETE / INSERT (default), Native MERGE, or Native MERGE with separate DELETE.
Lookup Fields	Comma-separated key columns for MERGE.
Predict Lookup Fields	Let Etlworks infer the keys from the destination's primary or unique key.
Databricks Credential	Optional inline storage credential clause for COPY INTO. Leave blank when an Unity Catalog External Location or managed Volume covers the stage path.
Override COPY INTO SQL	Provide the full COPY INTO statement yourself.
Include files in subfolders	Recurse into subdirectories of the stage path (default false).
Exclude Files / Include Files	Comma-separated glob patterns for filename filtering.
Skip Previously Processed Files	Maintain a cache of already-loaded filenames; skip on the next run (default false).
Maximum Number of Files to Process	Cap the per-run batch size.
Maximum Number of Parallel Loads	Concurrent load threads across destination tables (default 10).
Continue loading data into other tables if Error	Don't abort the whole flow when one destination fails (default false).
Purge File if Success / Purge File(s) if Error	Delete processed files (default true).
Before COPY SQL / After COPY SQL	SQL run on the Databricks connection before / after each COPY INTO (good for OPTIMIZE, VACUUM, Z-ORDER BY, partition prep).

Step 7. Optionally configure mapping

Define per-field mapping if source columns don't match destination column names.

Step 8. Optionally add more transformations

One bulk-load flow can target many destination tables — add one transformation per table.

Schema drift

Two flags control how schema differences are handled:

Create temp tables with only TEXT columns (default true) — the staging Delta table is created with every column as STRING, then the engine casts during COPY INTO. This absorbs minor schema changes between batches without failing the load.
Alter target table if source has new columns — enable to ALTER TABLE ADD COLUMNS when the source widens.

Handling processed files

Delete processed files

Purge File if Success is on by default. To keep the files after load, turn it off.

Skip already-processed files

Enable Skip Previously Processed Files. Etlworks maintains a cache of processed filenames; on the next run, matching files are ignored. The cache TTL and storage location are configurable.

Move processed files

Set the Move processed files to named connection on the flow to archive loaded files to a different bucket / path.

Troubleshooting

For common load issues (credential errors, missing external locations, type-cast failures), see Common issues when loading data into cloud data warehouses.

Articles in this section