- Starter
- Business
- Enterprise
- On-Premise
- Add-on
Overview
This tutorial demonstrates how to implement [near] real-time CDC-based change replication for the most popular databases using the following technologies:
- Native CDC for each source database.
- Apache Kafka.
- Debezium.
- Etlworks Kafka or Azure Event Hubs connector with built-in support for Debezium.
- Streaming flow that supports any of the following destinations:
In this article, we assume that the Debezium and Kafka run outside of Etlworks infrastructure. If you are looking for an end-to-end CDC solution without using third-party services such as Kafka, Connect, and Debezium we recommend Etlwork's built-in CDC connectors. Read more. Use this tutorial if you are already ingesting CDC events into Kafka using standalone Debezium and want to stream CDC events from Kafka/Azure Event Hubs into any destination using the advanced capabilities of Etlworks connectors.
How it works
Etlworks streams CDC events stored in Kafka, automatically transforms events to the DML SQL statements (INSERT
/ UPDATE
/ DELETE
), and executes SQL statements in the target database in the order they were created. It also supports bulk load into columnar databases. Etlworks streaming flow handles collisions and errors, ensuring the solution is 100% reliable. It is more advanced and supports more destinations than the Kafka Connect sink connectors.
Read about other change replication techniques available in Etlworks.
Sources and destinations
The following databases are supported as a source:
- Microsoft SQL Server
- MongoDB
- MySQL
- Oracle
- IBM DB2
- PostgreSQL
Supported destinations:
- File storage.
- Relational database.
- Database which supports the bulk load.
- Snowflake.
- Amazon Redshift.
- Google BigQuery.
- Azure Synapse Analytics.
- Greenplum.
- Vertica.
Prerequisites
The solution requires installing and configuring Apache Kafka, Zookeeper, Kafka Connect, and Debezium.
This official tutorial from Debezium explains how to install and configure all required components.
Quick start
We recommend Docker to install, configure and run Debezium. The Docker must be installed and running in the host OS.
docker-compose file
Here is an example of the docker-compose file, which can be used to quickly install and start the latest versions of Kafka, Zookeeper, Kafka Connect with Debezium, Kafka UI, and Debezium UI into the development box.
version: '2' services: zookeeper: image: quay.io/debezium/zookeeper:2.1.2.Final ports: - 2181:2181 - 2888:2888 - 3888:3888 kafka: image: quay.io/debezium/kafka:2.1.2.Final ports: - 9092:9092 links: - zookeeper environment: - ZOOKEEPER_CONNECT=zookeeper:2181 - KAFKA_AUTO_CREATE_TOPICS_ENABLE=true - ADVERTISED_HOST_NAME=etlworks connect: image: quay.io/debezium/connect:nightly ports: - 8083:8083 links: - kafka environment: - BOOTSTRAP_SERVERS=kafka:9092 - SCHEMA_HISTORY_INTERNAL_KAFKA_BOOTSTRAP_SERVERS=kafka:9092 - GROUP_ID=1 - KAFKA_AUTO_CREATE_TOPICS_ENABLE=true - CONFIG_STORAGE_TOPIC=connect_configs - OFFSET_STORAGE_TOPIC=connect_offsets - STATUS_STORAGE_TOPIC=connect_statuses debezium-ui: image: quay.io/debezium/debezium-ui:2.2 ports: - 8090:8080 links: - connect environment: - KAFKA_CONNECT_URIS=http://connect:8083 kafka-ui: image: provectuslabs/kafka-ui links: - kafka ports: - 8888:8080 restart: always environment: - KAFKA_CLUSTERS_0_NAME=local - KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS=kafka:9092
How to start
Step 1. Create docker-compose.yaml
using the example above in any folder.
Step 2. cd folder
Step 3. Run docker-compose up
.
Docker Desktop
If you are running Docker on Windows or Mac, most likely you are using Docker Desktop. Here is a list of containers provisioned after executing docker-compose up
for docker-compose.yaml
above.
Configure Debezium to capture CDC events
REST API
Assuming that the Debezium is up and running, we will be configuring a connector to the source database using Kafka Connect REST API. In this tutorial, we will be using Microsoft SQL Server. Configuring connectors to other databases is equally simple.
Official tutorials
The Debezium connectors are created using Kafka Connect REST API, so make sure either curl or Postman is installed in your development box. In this tutorial, we will be using Postman.
Step 1. Verify that Kafka Connect is installed and running.
The default port for Kafka Connect API is 8083
. Assuming that it runs on localhost, the URL for the API endpoint which returns configured connectors is:
http://localhost:8083/connectors
Step 2. Create a new connector for Microsoft SQL Server.
ENDPOINT URL: http://localhost:8083/connectors
METHOD: POST
PAYLOAD (example):
{
"name": "sqlserver-connector",
"config": {
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max": "1",
"database.hostname": "localhost",
"database.port": "1433",
"database.user": "sa",
"database.password": "password",
"database.dbname": "database_name",
"database.server.name": "database_server_name",
"table.whitelist": "comma separated list of fully_qualified_table_names",
"database.history.kafka.bootstrap.servers": "localhost:9092",
"database.history.kafka.topic": "dbhistory.database_server_name.database_name",
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable": "true",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter.schemas.enable": "true",
"include.schema.changes": "true"
}
}
Debezium UI
The alternative to REST API for creating Debezium connectors is a Debezium UI. Here is an official tutorial for Debezium UI from the creators of Debezium.
Note that it is installed when using docker-compose.yaml
above.
To start Debezium UI open the following URL in the browser: http://localhost:8090/
.
The naming convention for Kafka topics
Debezium stores CDC events in a separate topic for each table. If the connector was configured using the following parameters:
"database.dbname": "database_name",
"database.server.name": "database_server_name",
The CDC events for the table customer
will be stored in a Kafka topic database_server_name.database_name.customer
.
Create a Flow to stream CDC events created by Debezium into any destination
Step 1. Create and configure source connection for message queue as explained here.
Step 2. When configuring a source connection, make sure Integration with CDC provider
is set to Debezium
.
Step 3. Create a streaming flow for any of the following destinations:
- File storage.
- Relational database.
- Database which supports the bulk load.
- Snowflake.
- Amazon Redshift.
- Google BigQuery.
- Azure Synapse Analytics.
- Greenplum.
- Vertica.
Step 4. Configure TO (destination).
The TO depends on the type of destination.
File storage
The TO is a destination filename.
The TO in the streaming transformation can include any of the [tokens] below:
[table]
- the source table name.[db]
- the source database name.[schema]
- the source schema.*
- the source topic name.
The flow substitutes [tokens] on the values of the [tokens].
This example demonstrates how to configure a transformation to create a file that includes the source database and table name: [db]_[table].csv
.
Read how to set the destination name using JavaScript.
Relational database
The TO is a destination table name.
The TO in the streaming transformation can include any of the [tokens] below:
[table]
- the source table name.[db]
- the source database name.[schema]
- the source schema.*
- the source topic name.
The flow substitutes [tokens] on the values of the [tokens].
This example demonstrates how to configure a transformation load data into the table with the same name as the source table: publlic.[table]
.
Read how to set the destination name using JavaScript.
Database which supports the bulk load
The TO is a destination table name.
It is recommended to set TO as db.schema.*
. In this case, the flow will load data into the table with the same (or modified) name as the source table or topic. Read how to set the destination name using JavaScript.
Snowflake
The TO is the destination table name.
It is recommended to set TO as schema.*
. In this case, the flow will load data into the table with the same (or modified) name as the source table or topic. Read how to set the destination name using JavaScript.
Amazon Redshift
The TO is the destination table name.
It is recommended to set TO as schema.*
. In this case, the flow will load data into the table with the same (or modified) name as the source table or topic. Read how to set the destination name using JavaScript.
Google BigQuery
The TO is the destination table name.
It is recommended to set TO as project.schema.*
. In this case, the flow will load data into the table with the same (or modified) name as the source table or topic. Read how to set the destination name using JavaScript.
Azure Synapse Analytics
The To is the destination table name.
It is recommended to set TO as db.schema.*
. In this case, the flow will load data into the table with the same (or modified) name as the source table or topic. Read how to set the destination name using JavaScript.
Greenplum
The TO is the destination table name.
It is recommended to set TO as schema.*
. In this case, the flow will load data into the table with the same (or modified) name as the source table or topic. Read how to set the destination name using JavaScript.
Vertica
The TO is the destination table name.
It is recommended to set TO as schema.*
. In this case, the flow will load data into the table with the same (or modified) name as the source table or topic. Read how to set the destination name using JavaScript.
Step 5. Optionally, if you wish the MERGE (UPSERT) the data into the destination table, set Action
to CDC MERGE
when configuring the streaming flow. If the action is set to CDC MERGE the flow will INSERT records that do not exist in the destination table, UPDATE existing records, and DELETE records that were deleted in the source table.
Schedule the Flow
We recommend using a continuous run Schedule type. The idea is that the streaming Flow runs indefinitely or until there are no messages to extract, or it stops if there is an error. When the Flow stops for any of the reasons above, it restarts automatically after the configured number of seconds.
Comments
0 comments
Please sign in to leave a comment.