Databricks is a unified data and AI platform built on the Delta Lake / Lakehouse architecture. Etlworks ships several flow types optimized for loading and reading Databricks at high performance.
Which Databricks flow should I use?
| Flow | Use when |
|---|---|
| Any to Databricks (Database / File / Queue / Web service / Well-known API) | You need to extract from any source, optionally transform, and load into Databricks. |
| Bulk load files into Databricks | Files already exist in S3, ADLS Gen2, GCS, a Databricks Volume, or server storage. No transformation needed. Auto-generates COPY INTO; supports MERGE. |
| Stream CDC events into Databricks | You need real-time replication from a CDC-enabled source database (MySQL, SQL Server, PostgreSQL, Oracle, DB2, MongoDB, AS400). |
| Stream messages from a queue into Databricks | You need real-time ingestion from a message queue that supports streaming (Kafka, Kinesis, Pub/Sub, Service Bus, RabbitMQ, ActiveMQ, SQS). |
What do I need before I start?
- A Databricks workspace with a running compute resource (SQL warehouse, all-purpose cluster, or serverless SQL).
- The compute resource's Server hostname and HTTP path — both visible on the compute resource's Connection details tab in the Databricks UI.
- Authentication credentials — either a personal access token (PAT) or an OAuth service principal with client ID and OAuth secret.
- Permissions: USE CATALOG / USE SCHEMA on the target Unity Catalog catalog and schema, plus SELECT, MODIFY, and CREATE TABLE on the target schema for flows that auto-create destination tables.
- A stage for bulk operations: an S3 bucket, ADLS Gen2 container, GCS bucket, or a Databricks Volume that the workspace can read. For Unity Catalog, the stage typically lives behind an External Location with a storage credential.
Connect to Databricks
- Open the Connections window and click +.
- Type databricks in the search field.
- Select the Databricks connection.
- Pick the authentication method:
- Personal Access Token (default) — enter the host, HTTP path, and the token.
- OAuth Service Principal — enter the host, HTTP path, the service principal's client ID, and the OAuth secret. Recommended for production.
- Optionally set a default Schema. To target a different catalog / schema per flow, use a fully qualified catalog.schema.table name in the destination.
- For full connection-parameter reference, see configuring the Databricks connection.
Also create a connection for the stage. The supported stage types are:
- Amazon S3 — for AWS workspaces.
- Azure Storage (ADLS Gen2) — for Azure workspaces.
- Google Cloud Storage — for GCP workspaces.
- Server storage — for files already on the Etlworks instance.
- Databricks Volume — configure a server-storage connection pointed at the volume path (/Volumes/catalog/schema/volume/…).
Where to go next
| Topic | Article |
|---|---|
| Extract, transform, and load data into Databricks | Extract, transform, and load data in Databricks |
| Bulk-load existing files | Bulk load files into Databricks |
| ELT — run transformation SQL directly in Databricks | ELT with Databricks |
| Reverse ETL — extract from Databricks into any destination | Reverse ETL with Databricks |
| Data type mapping (JDBC ↔ Databricks) | Data type mapping for Databricks |
| Stream CDC events into Databricks | Create pipeline to CDC data into Databricks |
| Stream from message queues | Streaming with message queues |