Kafka Connect JDBC 与 Debezium CDC [英] Kafka Connect JDBC vs Debezium CDC

查看:58
本文介绍了Kafka Connect JDBC 与 Debezium CDC的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Debezium

Debezium 仅用作源连接器,记录所有行级更改.

Kafka 连接 JDBC

Kafka Connect JDBC 可用作 Kafka 的源连接器或接收器连接器,支持任何带有 JDBC 驱动程序的数据库.
JDBC 连接器文档 说:

<块引用>

可以使用Kafka Connect JDBC源连接器导入数据从任何带有 JDBC 驱动程序的关系数据库到 Apache Kafka®话题.您可以使用 JDBC sink 连接器从 Kafka 导出数据带有 JDBC 驱动程序的任何关系数据库的主题.JDBC连接器支持多种数据库,无需每个人的自定义代码.

他们有一些规范关于在 Microsoft SQL Server 上安装,我认为这与本次讨论无关.

所以如果JDBC Connector同时支持source和sink,而Debezium只支持source(不支持sink),我们理解为了从Kafka写数据到有JDBC驱动(sink)的数据库,JDBC Connector是必经之路(包括 SQL Server).

现在应该将比较范围缩小到源字段.
JDBC 源连接器文档 乍一看并没有说更多:

<块引用>

数据是通过定期执行 SQL 查询并创建一个结果集中每一行的输出记录.默认情况下,所有表在数据库中被复制,每个复制到它自己的输出主题.数据库监控新表或删除表并自动适应.什么时候从表中复制数据,连接器只能加载新的或修改过的通过指定应使用哪些列来检测新的或修改数据.

<小时>

进一步搜索以了解它们的差异,在此 Debezium 博客 以 Debezium MySQL Connector 为源,JDBC Connector 为接收器,有关于两者区别的说明,一般告诉我们Debezium 提供了更多关于数据库更改,而 JDBC 连接器提供的记录更侧重于将数据库更改转换为简单的插入/更新命令:

<块引用>

Debezium MySQL 连接器旨在专门捕获数据库更改并提供尽可能多的信息这些事件不仅仅是每一行的新状态.与此同时,Confluent JDBC Sink Connector 旨在简单地将每个根据数据库的结构插入/更新插入消息信息.因此,两个连接器具有不同的结构,用于消息,但它们也使用不同的主题命名约定和表示已删除记录的行为.

此外,它们有不同的主题命名和不同的删除方法:

<块引用>

Debezium 对代表的目标主题使用完全限定的命名它管理的每张桌子.命名遵循模式[逻辑名称].[数据库名称].[表名称].卡夫卡连接JDBC连接器使用简单的名称 [table-name].

...

<块引用>

当 Debezium 连接器检测到一行被删除时,它会创建两个事件消息:删除事件和墓碑消息.删除消息有一个信封,其中包含已删除行的状态before 字段和一个为 null 的 after 字段.墓碑留言包含与删除消息相同的键,但整个消息值为空,Kafka 的日志压缩利用这个来知道它可以删除任何具有相同密钥的较早消息.水槽数连接器,包括 Confluent 的 JDBC Sink 连接器,不是期待这些消息,如果他们看到任何一种都会失败的消息.

这个 Confluent 博客 详细解释了 CDC 和 JDBC 连接器的工作原理,它(JDBC 连接器)每隔固定时间间隔执行对源数据库的查询,这不是一个可扩展的解决方案,而 CDC 具有更高的频率,从数据库事务日志流式传输:

<块引用>

连接器通过对 JDBC 执行查询来工作源数据库.它这样做是为了拉入所有行(批量)或那些从以前开始改变(增量).该查询在间隔在 poll.interval.ms 中定义.根据数据量涉及的物理数据库设计(索引等)和其他数据库上的工作负载,这可能不是最可扩展的选项.

...

<块引用>

如果做得好,CDC 基本上可以让您流式传输每个事件从数据库到 Kafka.广义上讲,关系数据库使用事务日志(也称为二进制日志或重做日志,取决于 DB风味),数据库中的每个事件都写入其中.更新一个行,插入一行,删除一行——这一切都进入了数据库的交易记录.CDC 工具通常通过利用这个来工作以极低的延迟和低影响提取事务日志发生在数据库(或其中的架构/表)上的事件它).

这篇博文也说明了CDC和JDBC Connector的区别,主要是说JDBC Connector不支持同步删除记录,适合做原型,CDC适合更成熟的系统:

<块引用>

JDBC 连接器无法获取已删除的行.因为,你怎么查询不存在的数据?

...

<块引用>

我对 CDC 与 JDBC 的一般指导是 JDBC 非常适合原型设计,和精细的低容量工作负载.使用 JDBC 时要考虑的事项连接器:

不给出真正的CDC(捕获删除记录,想要之前/之后记录版本)检测新事件的延迟源数据库不断(并平衡它与所需的延迟)除非您从表中进行批量拉取,否则您需要具有可用于发现新记录的 ID 和/或时间戳.如果您不拥有架构,这会成为一个问题.

<小时>

tl;dr 结论

Debezium 和 JDBC 连接器之间的主要区别是:

  1. Debezium 仅用作 Kafka 源,而 JDBC 连接器可用作 Kafka 源和接收器.

对于来源:

  1. JDBC 连接器不支持同步已删除的记录,而 Debezium 支持.
  2. JDBC Connector 每隔固定的时间间隔查询一次数据库,这不是一个可扩展的解决方案,而 CDC 的频率更高,从数据库事务日志中流式传输.
  3. Debezium 提供有关数据库更改的更多信息的记录,而 JDBC 连接器提供的记录更侧重于将数据库更改转换为简单的插入/更新插入命令.
  4. 不同的主题命名.

What are the differences between JDBC Connector and Debezium SQL Server CDC Connector (or any other relational database connector) and when should I choose one over another, searching for a solution to sync between two relational databases?

Not sure if this discussion should be about CDC vs JDBC Connector, and not Debezium SQL Server CDC Connector, or even just Debezium, looking forward for later editing, depends on the given answers (Though my case is about SQL Server sink).

Sharing with you my research about this topic which led me to the question (as an answer)

解决方案

This explanation focuses on the differences between Debezium SQL Server CDC Connector and JDBC Connector, with more general interpretation about Debezium and CDC.

tl;dr- scroll down :)


Debezium

Debezium is used only as a source connector, records all row-level changes.
Debezium Documentation says:

Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

Debezium Connector for SQL Server first records a snapshot of the database and then sending records of row-level changes to Kafka, each table to different Kafka topic.
Debezium Connector for SQL Server Documentation says:

Debezium’s SQL Server Connector can monitor and record the row-level changes in the schemas of a SQL Server database.

The first time it connects to a SQL Server database/cluster, it reads a consistent snapshot of all of the schemas. When that snapshot is complete, the connector continuously streams the changes that were committed to SQL Server and generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.


Kafka Connect JDBC

Kafka Connect JDBC can be used either as a source or a sink connector to Kafka, supports any database with JDBC driver.
JDBC Connector Documentation says:

You can use the Kafka Connect JDBC source connector to import data from any relational database with a JDBC driver into Apache Kafka® topics. You can use the JDBC sink connector to export data from Kafka topics to any relational database with a JDBC driver. The JDBC connector supports a wide variety of databases without requiring custom code for each one.

They have some specifications about installing on Microsoft SQL Server which I find non relevant for this discussion.

So if JDBC Connector supports both source and sink and Debezium supports only source (not sink), we understand that in order to write data from Kafka to databases with a JDBC driver (sink), the JDBC Connector is the way to go (including SQL Server).

Now the comparison should be narrowed only to the sources field.
JDBC Source Connector Documentation doesn't say much more at first sight:

Data is loaded by periodically executing a SQL query and creating an output record for each row in the result set. By default, all tables in a database are copied, each to its own output topic. The database is monitored for new or deleted tables and adapts automatically. When copying data from a table, the connector can load only new or modified rows by specifying which columns should be used to detect new or modified data.


Searching a little further in order to understand their differences, in this Debezium blog which uses Debezium MySQL Connector as a source and JDBC Connector as a sink, there is an explanation about the differences between the two, which generally telling us that Debezium provides records with more information about the database changes, while JDBC Connector provides records which are more focused about converting the database changes into simple insert/upsert commands:

The Debezium MySQL Connector was designed to specifically capture database changes and provide as much information as possible about those events beyond just the new state of each row. Meanwhile, the Confluent JDBC Sink Connector was designed to simply convert each message into a database insert/upsert based upon the structure of the message. So, the two connectors have different structures for the messages, but they also use different topic naming conventions and behavior of representing deleted records.

Moreover, they have different topic naming and different delete methods:

Debezium uses fully qualified naming for target topics representing each table it manages. The naming follows the pattern [logical-name].[database-name].[table-name]. Kafka Connect JDBC Connector works with simple names [table-name].

...

When the Debezium connector detects a row is deleted, it creates two event messages: a delete event and a tombstone message. The delete message has an envelope with the state of the deleted row in the before field, and an after field that is null. The tombstone message contains same key as the delete message, but the entire message value is null, and Kafka’s log compaction utilizes this to know that it can remove any earlier messages with the same key. A number of sink connectors, including the Confluent’s JDBC Sink Connector, are not expecting these messages and will instead fail if they see either kind of message.

This Confluent blog explains more how CDC and JDBC Connector works, it (JDBC Connector) executing queries to the source database every fixed interval, which is not very scalable solution, while CDC has higher frequency, streaming from the database transaction log:

The connector works by executing a query, over JDBC, against the source database. It does this to pull in all rows (bulk) or those that changed since previously (incremental). This query is executed at the interval defined in poll.interval.ms. Depending on the volumes of data involved, the physical database design (indexing, etc.), and other workload on the database, this may not prove to be the most scalable option.

...

Done properly, CDC basically enables you to stream every single event from a database into Kafka. Broadly put, relational databases use a transaction log (also called a binlog or redo log depending on DB flavour), to which every event in the database is written. Update a row, insert a row, delete a row – it all goes to the database’s transaction log. CDC tools generally work by utilising this transaction log to extract at very low latency and low impact the events that are occurring on the database (or a schema/table within it).

This blog also states the differences between CDC and JDBC Connector, mainly says that JDBC Connector doesn't support syncing deleted records thus fits for prototyping, and CDC fits for more mature systems:

The JDBC Connector cannot fetch deleted rows. Because, how do you query for data that doesn’t exist?

...

My general steer on CDC vs JDBC is that JDBC is great for prototyping, and fine low-volume workloads. Things to consider if using the JDBC connector:

Doesn’t give true CDC (capture delete records, want before/after record versions) Latency in detecting new events Impact of polling the source database continually (and balancing this with the desired latency) Unless you’re doing a bulk pull from a table, you need to have an ID and/or timestamp that you can use to spot new records. If you don’t own the schema, this becomes a problem.


tl;dr Conclusion

The main differences between Debezium and JDBC Connector are:

  1. Debezium is used only as a Kafka source and JDBC Connector can be used as Kafka source and sink.

For sources:

  1. JDBC Connector doesn't support syncing deleted records, while Debezium does.
  2. JDBC Connector queries the database every fixed interval, which is not very scalable solution, while CDC has higher frequency, streaming from the database transaction log.
  3. Debezium provides records with more information about the database changes, and JDBC Connector provides records which are more focused about converting the database changes into simple insert/upsert commands.
  4. Different topic naming.

这篇关于Kafka Connect JDBC 与 Debezium CDC的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆