ETL with MongoDB – Etlworks Support

Etlworks reads from MongoDB collections, writes into them, and captures document-level changes via CDC. This article covers every supported pattern with step-by-step setup.

MongoDB stores data in BSON — a binary form of JSON that supports nested documents, arrays, and variable schemas. Etlworks can either flatten MongoDB documents into a tabular shape or preserve their nested structure, depending on where the data is going.

Extract data from a MongoDB collection

Two paths: CSV when the destination is relational and you want flat data, JSON when you need to preserve the nested document structure.

Path 1: extract as CSV (flat / tabular)

Use CSV when the destination is a relational database or a data warehouse (Snowflake, Redshift, Synapse, …). Nested fields are emitted as JSON-encoded strings so they can be loaded into a single column.

MongoDB to CSV

Create a MongoDB connection.
Create a CSV format. Defaults are fine for most cases.
Create a flow where the source is NoSQL. Type nosql to in Select Flow Type and pick the flow that matches your destination.
Set FROM to the MongoDB connection + CSV format. Enter the MongoDB collection name; wildcards are supported (e.g. customer*). If FROM uses a wildcard, TO can use one too.
Set TO to the destination connection + format (if needed) and the destination object name.
Add mapping or transformations as needed. For filtering and shaping, use SQL with MongoDB.

Path 2: extract as JSON (preserve nesting)

Use JSON when the destination should keep the nested document structure — for example, exporting as JSON files, or streaming into Kafka / Kinesis.

Create a MongoDB connection.
Create a JSON format. Defaults are fine.
Create a flow where the source is NoSQL. Type nosql to and pick the flow type that matches your destination.
Set FROM to the MongoDB connection + JSON format. Enter the MongoDB collection name; wildcards are supported.
Set TO to the destination connection + format and object name.
Add mapping or transformations. For filtering and shaping, use SQL with MongoDB.

Filtering during extraction

Use SQL with MongoDB for filtering and shaping — better readability and long-term support than the older MongoDB-query-in-collection-name approach.

Deprecated: the legacy collection.{filter} syntax in the FROM field (e.g. customers.{"STORE_ID":1}) still works for backward compatibility but is not recommended for new flows. See the MongoDB query language reference.

Deprecated collection.{filter}

Extract specific documents (MongoDB Document connection)

When you want one document (or a small group) instead of a full collection scan, use a MongoDB Document connection.

Create the MongoDB Document connection.
Create a JSON format.
Create a flow where the source is NoSQL. Type nosql to.
Set FROM to the MongoDB Document connection + JSON format. Enter the source identifier.
Set TO to the destination connection + format and object name.
In Mapping, enter SQL in Source query to narrow down what's read. See Using SQL with MongoDB.

Capture changes in MongoDB (CDC)

Etlworks ships a native MongoDB CDC connector that monitors replica sets and sharded clusters. The connector captures document-level changes (insert / update / delete) and emits them as events. Use it for real-time pipelines, event streaming, and incremental replication.

Reference:

Load data into MongoDB

Two patterns — pick based on whether each source row becomes its own document, or the whole dataset goes in as one document.

Pattern 1: one source row = one MongoDB document

Use this for tabular sources (database tables, CSV files, JSON arrays).

Create a MongoDB connection. Set What to do with the existing document when writing:
- insert — inserts new documents without checking for duplicates. Fastest. Use for initial loads only.
- update — updates only the fields present in the payload; other fields are left unchanged.
- replace — replaces the whole document. Slowest, but guarantees full replacement.
Create a JSON format. If the source is array-shaped (table, CSV, JSON array):
- Set First Object Type to variable.
- Set Root Object Name to any valid name (required).
Create a flow where the destination is NoSQL. Type to nosql and pick the flow type that matches your source.
Set FROM to the source connection + format and the source object name.
Set TO to the MongoDB connection + JSON format.
Enter the MongoDB collection name in TO. Optionally specify a unique field for the MongoDB _id using collection_name.unique_field_name — e.g. customers.CUSTOMER_ID.
Add mapping or transformations as needed.

Pattern 2: whole dataset = one MongoDB document

Use this for storing an Excel worksheet, a full JSON array, or a configuration payload as a single document.

Create a MongoDB Document connection.
Create a JSON format. If the source is an array, set First Object Type = variable and pick a meaningful Root Object Name.
Create a flow where the destination is NoSQL. Type to nosql.
Set FROM to the source connection + format and the source object name.
Set TO to the MongoDB Document connection + JSON format.
Enter the MongoDB document _id in TO.
Add mapping or transformations as needed.

Improve load performance

Three settings dominate write throughput when loading into MongoDB:

Parameter	What it does
Batch Size	Documents per write round-trip. Higher batch sizes mean fewer network calls and faster loads, especially for large datasets.
Write Concern	The acknowledgment level MongoDB requires for each write. Lower acknowledgment = faster writes but weaker durability. Use minimal acknowledgment for initial loads where you can re-run on failure.
What to do with the existing document when writing	insert is fastest (no lookup); use it for the initial load. Switch to update or replace for subsequent runs to handle duplicates correctly. replace is slowest because the whole document is rewritten every time.

Convert timestamp strings to ISODate

MongoDB queries and indexes work better against native ISODate values than against timestamp strings. The MongoDB connector can convert matching strings to ISODate on write — off by default.

Enable Timestamp Conversion on the MongoDB connection.
Set a Regex Pattern that identifies timestamp strings. If you don't provide one, the connector uses the default ISO 8601 pattern: ^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(\.\d{3})?(Z|[+-]\d{2}:\d{2})$.
Any string matching the pattern is converted to ISODate on write.

Timestamp to ISODate

Articles in this section