Overview
Etlworks supports various ETL and streaming techniques when working with MongoDB.
Extract data from MongoDB
Extract data from a collection using CSV format
Data in MongoDB is stored using BSON format, which is a binary version of JSON. It allows storing nested documents as well as documents with a variable number of columns.
Etlworks can convert nested JSON documents into flat format where each nested column is a JSON-encoded string.
Use cases (CSV)
- Extract data from MongoDB, and load it into a relational database.
- Extract data from MongoDB and load it into the data warehouses such as Snowflake, Amazon Redshift, and Azure Synapse Analytics.
Process (CSV)
Step 1. Create a MongoDB Connection.
Step 2. Create a CSV Format. The default settings for the CSV Format will work.
Step 3. Create a Flow where the source is a file, and the destination is a database.
Step 4. Select the MongoDB Connection created in Step 1 and the CSV Format created in Step 2 as a source (FROM
).
Step 5. Enter or select the MongoDB collection name into the FROM
field.
Note: This flow supports extract by a wildcard, for example customer*
. When the FROM
is configured with a wildcard, you can use a wildcard in the TO
as well.
Step 6. Select a destination Connection and Format (if needed), and enter the destination into the TO
field.
Step 7. Continue by adding Mapping and transformations. Read how to use SQL with MongoDB.
Extract data from a collection using JSON format
If you prefer to keep the nested structure when extracting the data from the MongoDB document, use JSON format.
Use cases (JSON)
- Extract data from MongoDB, and create JSON files.
- Extract data from MongoDB and load it into the message queue, such as Kafka or Kinesis.
Process (JSON)
Step 1. Create a MongoDB Connection.
Step 2. Create a JSON Format. The default settings for the JSON Format will work.
Step 3. Create a Flow where the source is a file, and the destination is anything.
Step 4. Select the MongoDB Connection created in Step 1 and the JSON Format created in Step 2 as a source (FROM
).
Note: This flow supports extract by a wildcard, for example customer*
. When the FROM
is configured with a wildcard, you can use a wildcard in the TO
as well.
Step 5. Enter or select the MongoDB collection name into the FROM
field.
Step 6. Select a destination Connection and Format (if needed), and enter the destination into the TO
field.
Step 7. Continue by adding Mapping and transformations. Read how to use SQL with MongoDB.
Extract data from a collection using a filter
Using filters is deprecated. Read how to use SQL with MongoDB.
When extracting data from a collection using either CSV or JSON format, it is possible to set a filter. When the filter is set, the flow filters out and optionally transforms documents retrieved from a collection.
Process
Follow the same steps as described here (CSV) and here (JSON).
When setting the FROM in step 4, enter the filter in the following format: collection.{filter}
.
Example for the filter: {"STORE_ID":1}
- extract all documents where STORE_ID = 1.
Read about MongoDB query language.
Extract data from a specific document or a group of documents
Step 1. Create a MongoDB document Connection.
Step 2. Create a JSON Format. The default settings for the JSON Format will work.
Step 3. Create a Flow where the source is a file, and the destination is anything.
Step 4. Select the MongoDB document Connection created in Step 1 and the JSON Format created in Step 2 as a source (FROM
).
Step 5. Enter or select the source into the FROM
field.
Step 6. Select a destination Connection, Format (if needed), and enter the destination into the TO
field.
Step 7. Click Mapping and enter SQL in Source query. Read how to use SQL with MongoDB.
Capture changes in MongoDB documents and collections
Etlworks MondoDB CDC connector can monitor a MongoDB replica set or a MongoDB shared cluster for document changes in databases and collections and send those changes as events to the destination.
Read how to set up a Flow to capture changes in MongoDB documents and collections.
Load data in MongoDB
Load data in MongoDB and create a separate document for each source record
Use it if you want to store each record of the source dataset as a separate MongoDB document.
Step 1. Create a MongoDB Connection.
When configuring the Connection, set the parameter What to do with the existing document when writing
to one of the following options:
-
replace
: replace the existing document with all the fields. -
update
: update only the fields in the existing document passed in the payload and leave other fields unchanged. -
insert
: this option adds new documents without checking for existing records. It offers the fastest performance because MongoDB does not need to perform any lookup operations. However, it should only be used for initial data loads, where there is no risk of inserting duplicate documents.
Step 2. Create a JSON Format. If the source document to store MongoDB is an ARRAY
(for example, a database table, CSV file, JSON array), then set the First Object Type
to variable
and the Root Object Name
to whatever name you want to use as a name for the root element. This field is required.
Step 3. Create a Flow where the destination is a file, and the source is anything.
Step 4. Select the source (FROM
) Connection and Format. Enter the source object name.
Step 5. Select the MongoDB Connection created in Step 1, and the JSON Format created in Step 2 as the destination (TO
).
Step 6. Enter or select the MongoDB collection name into the TO
field.
Optionally, you can specify the field which will be used to generate the unique value of the MongoDB object id (_id column
).
Format: collection_name.unique_field_name
Example: customers.CUSTOMER_ID
Step 7. Continue by adding Mapping and transformations.
Load data in MongoDB as a single document
Use it if you want to store the entire dataset (for example, the entire Excel worksheet) as a single MongoDB document.
Step 1. Create a MongoDB document Connection.
Step 2. Create a JSON Format. If the source document to store in MongoDB is an ARRAY
(for example, a database table, CSV file, JSON array, etc.), then set the First Object Type
to variable
and the Root Object Name
to whatever name you want to use as a name for the root element. This field is required.
Step 3. Create a Flow where the destination is a file, and the source is anything.
Step 4. Select the source (FROM
) Connection and Format. Enter the source object name.
Step 5. Select the MongoDB document Connection created in Step 1, and the JSON Format created in Step 2 as the destination (TO
).
Step 6. Enter or select the MongoDB object id into the TO
field.
Step 7. Continue by adding Mapping and transformations.
Improving Performance When Loading Documents into MongoDB
To optimize the process of loading data into MongoDB, several settings can be adjusted for better efficiency, particularly when handling large datasets. Key parameters such as Batch Size, Write Concern, and the action defined in What to do with the existing document when writing can significantly impact the performance of your data load operations.
Key Parameters for Performance Optimization:
Batch Size: Adjusting the batch size allows you to control how many documents are written in a single operation. A larger batch size can improve performance by reducing the number of network roundtrips between the ETL process and MongoDB. Increasing the batch size is particularly beneficial for large datasets, as it enables faster data transfers by minimizing the overhead associated with multiple, smaller write operations.
Write Concern: This parameter defines the level of acknowledgment required from MongoDB for write operations. For instance, you can configure the write concern to require only minimal acknowledgment for improved speed, particularly during initial data loads, when stricter guarantees on data durability are less critical.
What to Do with the Existing Document When Writing” Options
The What to do with the existing document when writing setting provides three options for managing existing documents in MongoDB, each with different performance characteristics:
- Insert: This option adds new documents without checking for existing records. It offers the fastest performance because MongoDB does not need to perform any lookup operations. However, it should only be used for initial data loads, where there is no risk of inserting duplicate documents.
- Update: This option checks for existing documents and updates them if they are found. While slower than Insert, it ensures that records are updated correctly and avoids creating duplicates.
- Replace: This option replaces the entire document if an existing one is found. It is the slowest option of the three but ensures that no old data remains.
Using “Insert” for Optimal Performance
For initial data loads, selecting Insert is the most efficient option, as it bypasses checks for existing records. This approach significantly speeds up the process because MongoDB can directly insert the data without any additional logic for matching or replacing documents. However, after the initial load, switching to Update or Replace ensures that future data loads handle duplicates and updates properly.
Conclusion
By adjusting Batch Size, configuring Write Concern, and using Insert for the initial load, you can drastically improve the performance of data loading operations into MongoDB.
Comments
0 comments
Please sign in to leave a comment.