Bulk load files that already exist in cloud storage, server storage, or a Databricks Volume into a Databricks Delta table. The flow auto-generates COPY INTO and supports INSERT, MERGE, and CDC MERGE actions.
When to use this flow
Use this flow when the data is already in files (typically CSV, JSON, Parquet, or Avro) and no transformation is needed before loading into Databricks. Common scenarios:
- Files dropped daily by a vendor into S3 / ADLS / GCS.
- Files produced by another flow (Etlworks or external) into a Databricks Volume.
- A periodic catch-up after a streaming CDC pause.
If you need transformation, use Any to Databricks instead.
How does this flow work?
- Etlworks iterates over the files in the stage matching the configured pattern.
- For each file, Etlworks issues COPY INTO against the destination Delta table (auto-created with USING DELTA if missing).
- For MERGE, the file is loaded into a temp Delta table first, then merged into the destination using the configured strategy.
- On success, processed files can be deleted, archived, or skipped on the next run.
Features
- File formats: CSV, JSON, Parquet, Avro.
- Stage types: S3, ADLS Gen2, GCS, server storage, Databricks Volumes (managed and external).
- Auto-generated COPY INTO with FORMAT_OPTIONS for delimiter, quote, header, line separator, date format, timestamp format.
- MERGE strategies: DELETE / INSERT, Native MERGE, Native MERGE with separate DELETE.
- Wildcard / regex file selection. Recursive subfolder scanning.
- Parallel loads across multiple destination tables.
- Skip-already-processed-files cache.
- Schema drift resilience via TEXT-only temp tables (see Schema drift).
What do I need before I start?
- A Databricks connection. See Get started with Databricks for setup.
- A stage connection (S3, ADLS Gen2, GCS, server storage, or Volume) and read permission for the user / service principal on that location.
- Files already present in the stage, or arriving on a schedule.
Step-by-step setup
Step 1. Create the stage connection
Pick one based on where your files live:
- Amazon S3 — for AWS-hosted workspaces.
- Azure Storage (ADLS Gen2) — for Azure workspaces.
- Google Cloud Storage — for GCP workspaces.
- Server storage — pointed at a Databricks Volume path (/Volumes/<catalog>/<schema>/<volume>/…) or a directory on the Etlworks host.
Step 2. Create the Databricks connection
See Get started with Databricks for connection setup. PAT for development, OAuth Service Principal for production.
Step 3. Create the format
Match the file format on disk: CSV, JSON, Parquet, or Avro.
Step 4. Create the flow
In Flows, click + and pick Bulk load files into Databricks.
Step 5. Configure the load transformation
Add a transformation per destination table. Set FROM to the stage connection, format, and the file name or wildcard. Set TO to the Databricks connection and destination table (use catalog.schema.table for Unity Catalog).
Step 6. Set parameters
Databricks-specific load parameters:
| Parameter | What it does |
|---|---|
| Action | INSERT (default), MERGE, or CDC MERGE. |
| How to MERGE | For MERGE / CDC MERGE: DELETE / INSERT (default), Native MERGE, or Native MERGE with separate DELETE. |
| Lookup Fields | Comma-separated key columns for MERGE. |
| Predict Lookup Fields | Let Etlworks infer the keys from the destination's primary or unique key. |
| Databricks Credential | Optional inline storage credential clause for COPY INTO. Leave blank when an Unity Catalog External Location or managed Volume covers the stage path. |
| Override COPY INTO SQL | Provide the full COPY INTO statement yourself. |
| Include files in subfolders | Recurse into subdirectories of the stage path (default false). |
| Exclude Files / Include Files | Comma-separated glob patterns for filename filtering. |
| Skip Previously Processed Files | Maintain a cache of already-loaded filenames; skip on the next run (default false). |
| Maximum Number of Files to Process | Cap the per-run batch size. |
| Maximum Number of Parallel Loads | Concurrent load threads across destination tables (default 10). |
| Continue loading data into other tables if Error | Don't abort the whole flow when one destination fails (default false). |
| Purge File if Success / Purge File(s) if Error | Delete processed files (default true). |
| Before COPY SQL / After COPY SQL | SQL run on the Databricks connection before / after each COPY INTO (good for OPTIMIZE, VACUUM, Z-ORDER BY, partition prep). |
Step 7. Optionally configure mapping
Define per-field mapping if source columns don't match destination column names.
Step 8. Optionally add more transformations
One bulk-load flow can target many destination tables — add one transformation per table.
Schema drift
Two flags control how schema differences are handled:
- Create temp tables with only TEXT columns (default true) — the staging Delta table is created with every column as STRING, then the engine casts during COPY INTO. This absorbs minor schema changes between batches without failing the load.
- Alter target table if source has new columns — enable to ALTER TABLE ADD COLUMNS when the source widens.
Handling processed files
Delete processed files
Purge File if Success is on by default. To keep the files after load, turn it off.
Skip already-processed files
Enable Skip Previously Processed Files. Etlworks maintains a cache of processed filenames; on the next run, matching files are ignored. The cache TTL and storage location are configurable.
Move processed files
Set the Move processed files to named connection on the flow to archive loaded files to a different bucket / path.
Troubleshooting
For common load issues (credential errors, missing external locations, type-cast failures), see Common issues when loading data into cloud data warehouses.