When to use change data capture connectors
Change data capture (CDC) is an approach to data integration that is based on the identification, capture, and delivery of the changes made to the source database and stored in the database redo log (also called transaction log).
Etlworks supports native log-based change data capture for PostgreSQL, SQL Server, MySQL, Oracle, DB2, and MongoDB.
We are using a heavily modified embedded Debezium engine for CDC.
Read about configuring the CDC pipeline in Etlworks.
Enabling CDC for the source database
Microsoft SQL Server
Read how to enable CDC for Microsoft SQL Server.
Read how the CDC connector for SQL Server works.
MySQL
Read how to enable CDC for MySQL.
Read how the CDC connector for MySQL works.
Oracle
Read how to enable CDC for Oracle.
Read how the CDC connector for Oracle works.
PostgreSQL
Read how to enable CDC for PostgreSQL.
Read how the CDC connector for PostgreSQL works.
DB2
Read how to enable CDC for DB2.
Read how the CDC connector for DB2 works.
MongoDB
Read about supported MongoDB topologies.
Read how to enable CDC for MongoDB.
Read how the CDC connector for MongoDB works.
Creating a connection
Step 1. In the Connections window, click the +
button and type in cdc
.
Step 2. Select the source database.
Step 3. Enter connection parameters.
Step 4. Test the connection.
Connection parameters
All databases
- Name - a required unique name of the connection. Attempting to register again with the same name will fail. The value of this field is used as a part of the offset file name, which tracks the latest CDC offrest.
- Include Table(s) - a comma-separated list of regular expressions that match fully-qualified table identifiers for tables to be monitored; any table not included in the whitelist will be excluded from monitoring. Each identifier is of the form databaseName.tableName. By default, the connector will monitor every non-system table in each monitored database. This parameter will be overridden in the TO field of the CDC flow.
- Excluded Columns(s) - an optional comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event message values. Fully-qualified names for columns are of the form databaseName.tableName.columnName, or databaseName.schemaName.tableName.columnName.
- CDC Key - a template for CDC key. The CDC key is used as a file name when streaming CDC events and creating files and as a topic name when streaming CDC events to a message queue, such as Kafka. The CDC key template is different for each database.
Authentication
- User - the user name
- Password - the password
Serialization
The parameters in this group are used to configure the format and the fields of the CDC stream.
- Add Event Type and Unique Sequence - if this option is enabled the connector adds two extra columns to the CDC stream:
- debezium_cdc_op - event type VARCHAR(1), "c" for create, "u" for update and "d" for delete.
- debezium_cdc_timestamp - the unique sequential number of the CDC event LONG.
- This parameter must be enabled when loading data into an online data warehouse, such as Amazon Redshift or Snowflake.
- Extra columns - the columns specified in this field will be added to the end of the stream. Read more about extra columns. Extra columns can be also used to implement the soft delete.
- Flatten CDC events - If this option is enabled the CDC connector will flatten CDC events, encoded using Extended JSON Format. Only the following data types are decoded: $oid,$symbol,$numberInt,$numberLong,$numberDouble,$numberDecimal,$code,$date,$minKey,$maxKey,$undefined). It will also transform [arrays] to strings.
- Convert JSON array to comma-separated list - if this option is enabled the CDC connector converts JSON arrays to the comma-separated list of strings: [1,2,3]->1,2,3
- Serialize CDC events as CSV files - if this option is enabled the CDC connector saves the CDC events as CSV files into the local (server) or cloud storage. The filename for each generated file is going to be [cdc_key]_uuid.csv, where uuid is an automatically generated globally unique identifier.
- - if this option is enabled the connector creates an XML file with the same name as the CSV file. The XML file contains information about the actual data types of the columns and which columns belong to the primary key. If present this information will be used during the load to set the data types of the columns in the destination. Read more about preserving the exact data types of the columns in the source table when creating a destination table.
- Columns in CSV File - The columns specified in this field will be used to create the CSV file. Other columns, except 'extra columns' and debezium_cdc_timestamp/debezium_cdc_op will be ignored. This field is case insensitive - you can enter columns in any CaSe and the system will match them regardless.
- Serialize CDC events as JSON files - if this option is enabled the CDC connector saves the CDC events as JSON files into the local (server) or cloud storage. The filename for each generated file is going to be [cdc_key]_uuid.json, where uuid is an automatically generated globally unique identifier. This option can be enabled together with Serialize CDC events as CSV files.
- Max Rows in CSV file - the maximum number of rows in CSV file. When the file reaches the max number of rows the connector closes the file.
- Keep file open (ms) - maximum wait time in milliseconds before closing the CSV file if not updated. The connector waits a configured number of milliseconds and if there are no new updates for the specific database.table - closes the file.
- Close CSV file if header is different - If this option is enabled the CDC connector will close the CSV file and will create a new one if the new header is different from the current header for the file with the same CDC key. Enable it if you expect that the source table structure can be changed or when it is possible to have a different number of columns in the CDC stream for the specific object (typically MongoDB collection).
- Close File After (ms) - the wait time in milliseconds before closing the CSV file even if it is still getting updated or has not reached the maximum number of rows.
- Escape double-quotes - is this option is enabled, a double-quote appearing inside a field will be escaped by preceding it with another double quote
- Convert null to empty - if this option is enabled the connection will convert SQL NULL values to an empty string when creating the CSV file. Read more about handling SQL NULLs and empty strings.
- Remove EOL characters - if this option is enabled the system will remove end-of-line (EOL) characters from the field's values when creating CSV files.
- Preprocessor - the optional JavaScript that can be used to change the CDC event before saving it to the CSV file. The following variables are available in the preprocessor code:
- event - the CDC event serialized as com.fasterxml.jackson.databind.JsonNode. You can change the event using the methods of the JsonNode class.
- key - the CDC key. This variable is immutable.
- Encoding - the character encoding of the generated CSV files.
Storage
The parameters in this group are used to configure where to store the generated CSV and JSON files.
- Storage type - CDC events serialized as files can be stored in the attached File system (default) or in any of the supported cloud storage systems:
- Amazon S3
- Azure Storage
- Google Storage
- Archive files - If the particular archive format is selected, the generated files will be archived using the selected format before copying to cloud storage. This parameter is ignored if the File system is selected as a storage type.
- Location of the CDC events - the location of the CDC events serialized as files. If this field is empty the system will use the default location, which is {app.data}/debezium_data/events. When configuring the location you can use tokens, for example {app.data} as a part of the field's value. NOTE: This field is required when cloud storage is selected:
- Amazon S3 - bucket/[optional folders]
- Azure Storage - container/[optional folders]
- Google Storage - bucket/[optional folders]
- Key - the key (username) for the cloud storage account:
- Amazon S3 - Access Key ID
- Azure Storage - Storage account
- Google Storage - Access Key
- Secret - the secret (password for the cloud storage account:
- Amazon S3 - Secret Access Key
- Azure Storage - Access Key
- Google Storage - Secret
Offset parameters
Active CDC connection records the latest offset in the transaction log to the file {app.data}/debezium_data/offeset_file_name.dat. You can override the default file name which is calculated using the unique name of the connection.
- Offset Commit Interval (ms) - Interval at which connector will be committing offsets. The default is 1 minute.
- Offset Flush Timeout (ms) - the maximum number of milliseconds to wait for records to flush and partition offset data to be committed to offset storage before canceling the process and restoring the offset data to be committed in a future attempt. The default is 5 seconds.
- Offset File Name - the offset file name. If this field is empty the offset file name will be calculated automatically using the unique name of the connection. The offset file is located in {app.data}/debezium_data.
- DDL History File Name - the DDL history file name. If this field is empty the DDL file name will be calculated automatically using the unique name of the connection. The DDL history file is located in {app.data}/debezium_data.
CDC Stream
- Maximum Queue Size - a positive integer value that specifies the maximum size of the blocking queue into which change events read from the database log are placed before they are written to the stream. This queue can provide backpressure to the transactional log reader when, for example, writes to the stream are slower. Events that appear in the queue are not included in the offsets periodically recorded by this connector. Defaults to 8192, and should always be larger than the maximum batch size.
- Maximum Batch Size - a positive integer value that specifies the maximum size of each batch of events that should be processed during each iteration of this connector. Defaults to 2048.
- Poll Interval in Milliseconds - a positive integer value that specifies the number of milliseconds the connector should wait during each iteration for new change events to appear. Defaults to 1000 milliseconds, or 1 second.
Snapshot
- Snapshot Mode - specifies the criteria for running a snapshot upon startup of the connector. It is different for each database (see below).
- Snapshot Locking Mode - the optional locking mode for the specific database when performing the snapshot. It is different for each database (see below).
How to stop CDC Stream
The flow will stop if any of the above reaches the configured threshold.
- Number of retries before giving up - the number of retries before giving up if poll returns no records. The default is 5.
- Retry N minutes before giving up - the number of minutes to retry before giving up if the poll returns no records.
- Always stop after N minutes - the number of minutes to run the connector before stopping. The connector will stop even if there are new records in the queue. Note that the connector will NOT stop during the snapshot
Other parameters
- Other Parameters - other parameters as key=value pairs. The parameters are specific for the different connectors.
- Connection Timeout in Milliseconds - a positive integer value that specifies the maximum time in milliseconds this connector should wait after trying to connect to the database server before timing out. Defaults to 30 seconds.
How to handle errors
When there are multiple connectors configured to run on the same database (for example each connector could be extracting data from a different group of tables) it is possible that the error in one connector (for example when the connector losing the connectivity to the source database) could cause cascading errors in other running connectors. These types of errors could cause a temporary spike in CPU/RAM utilization when multiple connectors are trying to recover from the error at the same time. In this case, it makes sense to shutdown all the connectors running on the same database and let the flows restart automatically.
- Errors to trigger stop for all CDC flows - a semicolon-separated list of errors which will trigger stop for all CDC connectors running for the database with the same host:port. The list can contain the entire error or the part of the error. Example: Failed to deserialize data;EOFException - in order to trigger the stop the exception stack trace must contain 'Failed to deserialize data' AND 'EOFException'.
Formats and Templates
- Template - use template to populate columns for mapping without actually connecting to the source database. Example: {\"first\":\"string\",\"last\":\"string\"}
- Date and Time Format - format for the fields with a TIMESTAMP type. If not entered the TIMESTAMP fields will be serialized as Linux epoch time (in milliseconds).
- Date Format - format for the fields with a DATE type. If not entered the DATE fields will be serialized as Linux epoch time (in milliseconds).
- Time Format - format for the fields with TIME type. If not entered the TIME fields will be serialized as Linux epoch time (in milliseconds).
Connecting to a database over SSH tunnel
If you need to access a database that can only be accessed via an SSH tunnel, you need to specify additional information in the "Connect over SSH tunnel" section of the database connection screen.
- SSH Host - the name or IP address for the host accepting SSH connections.
- SSH Port - the port accepting SSH connections. The default value is 22.
- SSH User - the user name
- SSH Password - the optional password
- Private Key File - the private key file in the
pem
orppk
format used for SSH authentication. Click the "Upload/Select ssh key" button to manage SSH keys using GUI. You can also upload the private key file manually and use the token{app.data}
as a part of the filename, for example,{app.data}/keys/secret.pem
. This parameter is optional.
Database specific parameters
SQL Server parameters
SQL Server CDC connector parameters
- Host - the host or IP address of the source database.
- Port - the port the source database is listening to.
- Database - the name of the database to connect on startup.
- Include Table(s) - a comma-separated list of regular expressions that match fully-qualified table identifiers for tables to be monitored; any table not included in the whitelist will be excluded from monitoring. Each identifier is of the form databaseName.tableName. By default, the connector will monitor every non-system table in each monitored database. This parameter will be overridden in the TO field of the CDC flow.
- Excluded Columns(s) - an optional comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event message values. Fully-qualified names for columns are of the form databaseName.tableName.columnName, or databaseName.schemaName.tableName.columnName.
- Server - logical name that identifies and provides a namespace for the particular SQL Server database server being monitored. The logical name should be unique across all other connectors.
- CDC Key - a template for CDC key. The CDC key is used as a file name when streaming CDC events and creating files and as a topic name when streaming CDC events to a message queue such as Kafka. Example of the CDC key: testDB_dbo_customers_cdc_stream. Available tokens:
- [db] - database name,
- [schema] - schema name,
- [table] - table name.
- [op] - the CDC operation: c (create), d (delete), u (update)
- [timestamp] - the timestamp of the event in ms
- [version] - the connector version
- [connector] - the connector class name
- [name] - the connector name
- [ts_ms] - the timestamp of the event in ms
- [snapshot] - true for a snapshot
- [change_lsn] - change lns
- [commit_lsn] - commit lns
- [event_serial_no] - event serial number
- JDBC Fetch Size - the JDBC Fetch Size field can be used to set the maximum number of records which will be retrieved in one database call when fetching the data in a snapshot mode, thus limiting the memory consumption within the JVM. The default is 10000.
- Max Rows to Read - the maximum number of rows to read when fetching the data in a snapshot mode. The default is empty, hence no limit.
- Max Field Size - sets the limit for the maximum number of bytes that can be returned for character and binary column values in a ResultSet object produced by this Statement object. The default is empty, hence there is no limit.
- Snapshot Mode - a mode for taking an initial snapshot of the structure and optional data of captured tables. Supported values are:
- initial (will take a snapshot of structure and data of captured tables; useful if topics should be populated with a complete representation of the data from the captured tables)
- initial_schema_only (will take a snapshot of the structure of captured tables only; useful if only changes happening from now onwards should be propagated to topics). Once the snapshot is complete, the connector will continue reading change events from the database’s redo logs.
- Snapshot Isolation Node - mode to control which transaction isolation level is used and how long the connector locks the monitored tables. There are five possible values: read_uncommitted, read_committed, repeatable_read, snapshot, and exclusive ( in fact, exclusive mode uses repeatable read isolation level, however, it takes the exclusive lock on all tables to be read). It is worth documenting that snapshot, read_committed and read_uncommitted modes do not prevent other transactions from updating table rows during initial snapshot, while exclusive and repeatable_read do. Another aspect is data consistency. Only exclusive and snapshot modes guarantee full consistency, that is, initial snapshot and streaming logs constitute a linear history. In case of repeatable_read and read_committed modes, it might happen that, for instance, a record added appears twice - once in initial snapshot and once in the streaming phase. Nonetheless, that consistency level should do for data mirroring. For read_uncommitted there are no data consistency guarantees at all (some data might be lost or corrupted).
Debezium SQL Server properties
Read about all properties available for Debezium SQL Server connector. The properties can be set as key-value pairs in Other Parameters.
MySQL parameters
MySQL CDC connector parameters
- Host - the host or IP address of the source database.
- Port - the port the source database is listening to.
- Database - the name of the database to connect on startup.
- Include Databases(s) - optional comma-separated list of regular expressions that match database names to be monitored; any database name not included in the whitelist will be excluded from monitoring. By default, all databases will be monitored.
- Include Table(s) - a comma-separated list of regular expressions that match fully-qualified table identifiers for tables to be monitored; any table not included in the whitelist will be excluded from monitoring. Each identifier is of the form databaseName.tableName. By default, the connector will monitor every non-system table in each monitored database. This parameter will be overridden in the TO field of the CDC flow.
- Excluded Columns(s) - an optional comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event message values. Fully-qualified names for columns are of the form databaseName.tableName.columnName, or databaseName.schemaName.tableName.columnName.
- Server - logical name that identifies and provides a namespace for the particular MySQL database server/cluster being monitored. The logical name should be unique across all other connectors. Defaults to host:port, though we recommend using an explicit and meaningful logical name.
- Case sensitive matching - By default, the system will use the case insensitive pattern to whitelist the db.table to stream. For example, if there is a database 'test' and the tables 'abc' and 'Abc' the stream will include records from 'test.abc' and 'test.Abc'. By enabling this option you can enforce the case sensitive matching of the db.table defined for the source.
- CDC Key - a template for CDC key. The CDC key is used as a file name when streaming CDC events and creating files and as a topic name when streaming CDC events to a message queue such as Kafka. Example of the CDC key: inventory_customers_cdc_stream. Available tokens:
- [db] - database name,
- [table] - table name.
- [op] - the CDC operation: c (create), d (delete), u (update)
- [timestamp] - the timestamp of the event in ms
- [version] - the connector version
- [connector] - the connector class name
- [name] - the connector name
- [ts_ms] - the timestamp of the event in ms
- [snapshot] - true for a snapshot
- [server_id] - the MySQL server ID
- [gtid] - the GTID
- [file] - the binlog filename
- [pos] - the position in the binlog
- [row] - the row number
- JDBC Fetch Size - the JDBC Fetch Size field can be used to set the maximum number of records which will be retrieved in one database call when fetching the data in a snapshot mode, thus limiting the memory consumption within the JVM. The default is 10000.
- Max Rows to Read - the maximum number of rows to read when fetching the data in a snapshot mode. The default is empty, hence no limit.
- Max Field Size - sets the limit for the maximum number of bytes that can be returned for character and binary column values in a ResultSet object produced by this Statement object. The default is empty, hence no limit.
- Snapshot Mode - specifies the criteria for running a snapshot upon startup of the connector.
- The default is initial, and specifies the connector can run a snapshot only when no offsets have been recorded for the logical server name.
- The when_needed option specifies that the connector run a snapshot upon startup whenever it deems it necessary (when no offsets are available, or when a previously recorded offset specifies a binlog location or GTID that is not available in the server).
- The never option specifies that the connect should never use snapshots and that upon first startup with a logical server name the connector should read from the beginning of the binlog. This should be used with care, as it is only valid when the binlog is guaranteed to contain the entire history of the database.
- If you don’t need the topics to contain a consistent snapshot of the data but only need them to have the changes since the connector was started, you can use the schema_only option, where the connector only snapshots the schemas (not the data).
- schema_only_recovery is a recovery option for an existing connector to recover a corrupted or lost database history topic, or to periodically clean up a database history topic (which requires infinite retention) that may be growing unexpectedly.
- We recommend setting the snapshot mode to when_needed. The when_needed option specifies that the connector run a snapshot upon startup whenever it deems it necessary: when no offsets are available, or when a previously recorded offset specifies a binlog location or GTID that is not available in the server.
- Snapshot Locking Node - controls if and how long the connector holds onto the global MySQL read lock (preventing any updates to the database) while it is performing a snapshot. There are three possible values minimal, extended, and none.
- minimal The connector holds the global read lock for just the initial portion of the snapshot while the connector reads the database schemas and other metadata. The remaining work in a snapshot involves selecting all rows from each table, and this can be done in a consistent fashion using the REPEATABLE READ transaction even when the global read lock is no longer held and while other MySQL clients are updating the database.
- extended In some cases where clients are submitting operations that MySQL excludes from REPEATABLE READ semantics, it may be desirable to block all writes for the entire duration of the snapshot.
- none will prevent the connector from acquiring any table locks during the snapshot process. This value can be used with all snapshot modes but it is safe to use if and only if no schema changes are happening while the snapshot is taken. Note that for tables defined with MyISAM engine, the tables would still be locked despite this property being set as MyISAM acquires a table lock. This behavior is unlike InnoDB engine which acquires row level locks.
Debezium MySQL properties
Read about all properties available for Debezium MySQL connector. The properties can be set as key-value pairs in Other Parameters.
Oracle parameters
Oracle CDC connector parameters
- Host - the host or IP address of the source database.
- Port - the port that the source database is listening to.
- Database - the name of the database to connect on startup. Must be the CDB name when working with the CDB + PDB model.
- Name of the PDB - name of the PDB to connect to, when working with the CDB + PDB model.
- Include Table(s) - a comma-separated list of regular expressions that match fully-qualified table identifiers for tables to be monitored; any table not included in the whitelist will be excluded from monitoring. Each identifier is of the form databaseName.tableName. By default, the connector will monitor every non-system table in each monitored database. This parameter will be overridden in the TO field of the CDC flow.
- Excluded Columns(s) - an optional comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event message values. Fully-qualified names for columns are of the form databaseName.tableName.columnName, or databaseName.schemaName.tableName.columnName.
- Server - logical name that identifies and provides a namespace for the particular Oracle database server being monitored. The logical name should be unique across all other connectors.
- CDC Key - a template for CDC key. The CDC key is used as a file name when streaming CDC events and creating files and as a topic name when streaming CDC events to a message queue such as Kafka. Example of the CDC key: server1_DEBEZIUM_CUSTOMERS_Envelope_cdc_stream. Available tokens:
- [source] - fully qualified event source.
- [op] - the CDC operation: c (create), d (delete), u (update)
- [timestamp] - the timestamp of the event in ms
- [version] - the connector version
- [connector] - the connector class name
- [name] - the connector name
- [ts_ms] - the timestamp of the event in ms
- [snapshot] - true for a snapshot
- [txId] - the transaction id
- [scn] - the scn
- CDC Adapter - the adapter implementation to use. xstream uses the Oracle XStreams API. logminer uses the native Oracle LogMiner API. The default is logminer.
- Name of the XStream outbound server - name of the XStream outbound server configured in the database. This parameter is ignored if CDC Adapter set to logminer.
- JDBC Fetch Size - the JDBC Fetch Size field can be used to set the maximum number of records which will be retrieved in one database call when fetching the data in a snapshot mode, thus limiting the memory consumption within the JVM. The default is 10000.
- Max Rows to Read - the maximum number of rows to read when fetching the data in a snapshot mode. The default is empty, hence there is no limit.
- Max Field Size - sets the limit for the maximum number of bytes that can be returned for character and binary column values in a ResultSet object produced by this Statement object. The default is empty, hence there is no limit.
- Snapshot Mode - a mode for taking an initial snapshot of the structure and optional data of captured tables. Supported values are
- initial (will take a snapshot of structure and data of captured tables; useful if topics should be populated with a complete representation of the data from the captured tables)
- initial_schema_only (will take a snapshot of the structure of captured tables only; useful if only changes happening from now onwards should be propagated to topics). Once the snapshot is complete, the connector will continue reading change events from the database’s redo logs.
Debezium Oracle properties
Read about all properties available for Debezium Oracle connector. The properties can be set as key-value pairs in Other Parameters.
Postgres parameters
Postgres CDC connector parameters
- Host - the host or IP address of the source database.
- Port - the port the source database is listening to.
- Database - the name of the database to connect on startup.
- Include Schemas(s) - An optional, comma-separated list of regular expressions that match names of schemas for which you want to capture changes. Any schema name not included in schema.include.list is excluded from having its changes captured. By default, all non-system schemas have their changes captured. Do not also set the schema.exclude.list property.
- Include Table(s) - a comma-separated list of regular expressions that match fully-qualified table identifiers for tables to be monitored; any table not included in the whitelist will be excluded from monitoring. Each identifier is of the form databaseName.tableName. By default, the connector will monitor every non-system table in each monitored database. This parameter will be overridden in the TO field of the CDC flow.
- Excluded Columns(s) - an optional comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event message values. Fully-qualified names for columns are of the form databaseName.tableName.columnName, or databaseName.schemaName.tableName.columnName.
- Server - logical name that identifies and provides a namespace for the particular PostgreSQL database server/cluster being monitored. The logical name should be unique across all other connectors, since it is used as a prefix for all Kafka topic names coming from this connector. Defaults to host:port/dbname, though we recommend using an explicit and meaningful logical name.
- Output Plugin - PostgreSQL’s logical decoding feature was first introduced in version 9.4 and is a mechanism that allows the extraction of the changes which were committed to the transaction log and the processing of these changes in a user-friendly manner via the help of an output plugin. This output plugin must be installed prior to running the PostgreSQL server and enabled together with a replication slot in order for clients to be able to consume the changes.
- Replication Slot Name - The name of the PostgreSQL logical decoding slot that was created for streaming changes from a particular plug-in for a particular database/schema. The server uses this slot to stream events to the connector that you are configuring.
- Append Included Tables to Replication Slot Name - If this option is enabled the system will append the 'included tables' to the replication slot name. This will create a new replication slot if the list has changed, therefore triggering the new snapshot for all included tables.
- CDC Key - a template for CDC key. The CDC key is used as a file name when streaming CDC events and creating files and as a topic name when streaming CDC events to a message queue such as Kafka. Example of the CDC key: postgres_public_customers_cdc_stream. Available tokens:
- [db] - database name,
- [schema] - schema name,
- [table] - table name.
- [op] - the CDC operation: c (create), d (delete), u (update)
- [timestamp] - the timestamp of the event in ms
- [version] - the connector version
- [connector] - the connector class name
- [name] - the connector name
- [ts_ms] - the timestamp of the event in ms
- [snapshot] - true for a snapshot
- [txId] - the transaction ID
- [lsn] - the lsn
- [xmin] - the xmin
- JDBC Fetch Size - the JDBC Fetch Size field can be used to set the maximum number of records which will be retrieved in one database call when fetching the data in a snapshot mode, thus limiting the memory consumption within the JVM. The default is 10000.
- Max Rows to Read - the maximum number of rows to read when fetching the data in a snapshot mode. The default is empty, hence there is no limit.
- Max Field Size - sets the limit for the maximum number of bytes that can be returned for character and binary column values in a ResultSet object produced by this Statement object. The default is empty, hence there is no limit.
- Snapshot Mode - specifies the criteria for running a snapshot upon startup of the connector.
- The default is initial, and specifies the connector can run a snapshot only when no offsets have been recorded for the logical server name.
- The always option specifies that the connector run a snapshot each time on startup.
- The never option specifies that the connect should never use snapshots and that upon first startup with a logical server name the connector should read from either from where it last left off (last LSN position) or start from the beginning from the point of the view of the logical replication slot.
- Finally, the initial_only option specifies that the connector should only take an initial snapshot and then stop, without processing any subsequent changes.
It is strongly recommended that you configure a PostgreSQL connector to set snapshot.mode to exported. The initial, initial only and always modes can lose a few events while a connector switches from performing the snapshot to streaming change event records when a database is under heavy load.
Debezium PostgreSQL properties
Read about all properties available for Debezium PostgreSQL connector. The properties can be set as key-value pairs in Other Parameters.
DB2 parameters
DB2 CDC connector parameters
- Host - the host or IP address of the source database.
- Port - the port the source database is listening to.
- Database - the name of the database to connect on startup.
- Include Table(s) - a comma-separated list of regular expressions that match fully-qualified table identifiers for tables to be monitored; any table not included in the whitelist will be excluded from monitoring. Each identifier is of the form databaseName.tableName. By default, the connector will monitor every non-system table in each monitored database. This parameter will be overridden in the TO field of the CDC flow.
- Excluded Columns(s) - an optional comma-separated list of regular expressions that match the fully-qualified names of columns that should be excluded from change event message values. Fully-qualified names for columns are of the form databaseName.tableName.columnName, or databaseName.schemaName.tableName.columnName.
- Server - logical name that identifies and provides a namespace for the particular Db2 database server that hosts the database for which Debezium is capturing changes. Only alphanumeric characters and underscores should be used in the database server logical name. The logical name should be unique across all other connectors, since it is used as a topic name prefix for all Kafka topics that receive records from this connector.
- CDC Key - a template for CDC key. The CDC key is used as a file name when streaming CDC events and creating files and as a topic name when streaming CDC events to a message queue such as Kafka. Example of the CDC key: testDB_testSchema_customers_cdc_stream. This template defines how CDC Key is generated. Available tokens: [db] - database name, [schema] - schema name, [table] - table name. Also: [op], [timestamp], [version], [connector], [name], [ts_ms], [snapshot], [change_lsn], [commit_lsn].
- JDBC Fetch Size - the JDBC Fetch Size field can be used to set the maximum number of records which will be retrieved in one database call when fetching the data in a snapshot mode, thus limiting the memory consumption within the JVM. The default is 10000.
- Max Rows to Read - the maximum number of rows to read when fetching the data in a snapshot mode. The default is empty, hence no limit.
- Max Field Size - sets the limit for the maximum number of bytes that can be returned for character and binary column values in a ResultSet object produced by this Statement object. The default is empty, hence there is no limit.
- Snapshot Mode - specifies the criteria for performing a snapshot when the connector starts:
- initial - For tables in capture mode, the connector takes a snapshot of the schema for the table and the data in the table. This is useful for populating Kafka topics with a complete representation of the data.
- schema_only - For tables in capture mode, the connector takes a snapshot of only the schema for the table. This is useful when only the changes that are happening from now on need to be emitted to Kafka topics. After the snapshot is complete, the connector continues by reading change events from the database’s redo logs.
- Snapshot Isolation Node - during a snapshot, controls the transaction isolation level and how long the connector locks the tables that are in capture mode. The possible values are:
- read_uncommitted - Does not prevent other transactions from updating table rows during an initial snapshot. This mode has no data consistency guarantees; some data might be lost or corrupted.
- read_committed - Does not prevent other transactions from updating table rows during an initial snapshot. It is possible for a new record to appear twice: once in the initial snapshot and once in the streaming phase. However, this consistency level is appropriate for data mirroring.
- repeatable_read - Prevents other transactions from updating table rows during an initial snapshot. It is possible for a new record to appear twice: once in the initial snapshot and once in the streaming phase. However, this consistency level is appropriate for data mirroring.
- exclusive - Uses repeatable read isolation level but takes an exclusive lock for all tables to be read. This mode prevents other transactions from updating table rows during an initial snapshot. Only exclusive mode guarantees full consistency; the initial snapshot and streaming logs constitute a linear history.
Debezium DB2 properties
Read about all properties available for Debezium DB2 connector. The properties can be set as key-value pairs in Other Parameters.
MongoDB parameters
MongoDB CDC connector parameters
- Cluster URL - the required connection string. Read about the MongoDB connection string.
- Replica Sets or Shards - to use the MongoDB connector with a replica set, simply provide the addresses of one or more replica set servers as seed addresses. The connector will use these seeds to connect to the replica set, and then once connected will get from the replica set the complete set of members and which member is primary. The connector will start a task to connect to the primary and capture the changes from the primary’s oplog. When the replica set elects a new primary, the task will automatically switch over to the new primary. To use the MongoDB connector with a sharded cluster, configure the connector with the host addresses of the configuration server replica set. When the connector connects to this replica set, it discovers that it is acting as the configuration server for a sharded cluster, discovers the information about each replica set used as a shard in the cluster, and will then start up a separate task to capture the changes from each replica set. If new shards are added to the cluster or existing shards removed, the connector will automatically adjust its tasks accordingly.
- Include Databases(s) - optional comma-separated list of regular expressions that match database names to be monitored; any database name not included in the whitelist is excluded from monitoring. By default all databases is monitored.
- Include Collection(s) - optional comma-separated list of regular expressions that match database names to be excluded from monitoring; any database name not included in the blacklist is monitored.
- Cluster Name - a unique name that identifies the MongoDB replica set or sharded cluster that this connector monitors. Defaults to URL, though we recommend using an explicit and meaningful logical name.
- CDC Key - a template for CDC key. The CDC key is used as a file name when streaming CDC events and creating files and as a topic name when streaming CDC events to a message queue such as Kafka. Example of the CDC key: inventory_customers_cdc_stream. Available tokens:
-
- [db] - the database name
- [collection] - the collection name
- [op] - the CDC operation: c (create), d (delete), u (update)
- [timestamp] - the timestamp of the event in ms
- [version] - the connector version
- [connector] - the connector class name
- [name] - the connector name
- [ts_ms] - the timestamp of the event in ms
- [snapshot] - true for a snapshot
- [order] - the order of the event
- [h] - the h of the event
- Reconstruct Record on Update - Update events in MongoDB’s oplog do not have the before or after states of the changed document, as a result, the CDC event received from the oplog might contain only updated fields. If you enable this option the connector will try to reconstruct the entire record by querying MongoDB directly. Note that this parameter is disabled by default.
- Snapshot Mode - specifies the criteria for running a snapshot (eg. initial sync) upon startup of the connector.
- The default is initial, and specifies that the connector reads a snapshot when either no offset is found or if the oplog no longer contains the previous offset.
- The never option specifies that the connector should never use snapshots, instead the connector should proceed to tail the log.
- New name of the Object ID (_id) field - each record in MongoDB collection has an "_id" field. Typically it is some sort of nested JSON object, often generated automatically. The CDC flow will automatically flatten the value of the "_id" field and transform it into a unique string. E
"_id": ObjectId("54759eb3c090d83494e2d804")
to"_id": "54759eb3c090d83494e2d804"
. It will also rename the "_id" field. The default new name is "mongo_id". - Number of Documents in Explorer - The maximum number of the MongoDB documents that can be displayed in the Explorer. The default value is 10, the maximum possible value is 9999. MongoDB can contain millions of documents, so displaying all of them can kill the server.
Debezium MongoDB properties
Read about all properties available for Debezium MongoDB connector. The properties can be set as key-value pairs in Other Parameters.
Comments
0 comments
Please sign in to leave a comment.