Kafka Connect JDBC与Debezium CDC [英] Kafka Connect JDBC vs Debezium CDC

查看:608
本文介绍了Kafka Connect JDBC与Debezium CDC的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

JDBC连接器 Debezium SQL Server CDC连接器(或任何其他关系数据库连接器),何时选择一个在另一个之上,寻找一种在两个关系数据库之间进行同步的解决方案?

What are the differences between JDBC Connector and Debezium SQL Server CDC Connector (or any other relational database connector) and when should I choose one over another, searching for a solution to sync between two relational databases?

不确定此讨论是否应该针对CDC与JDBC连接器,而不是Debezium SQL Server CDC连接器,甚至不是Debezium,期待以后进行编辑,取决于给出的答案(尽管我的情况是关于SQL Server接收器的) .

Not sure if this discussion should be about CDC vs JDBC Connector, and not Debezium SQL Server CDC Connector, or even just Debezium, looking forward for later editing, depends on the given answers (Though my case is about SQL Server sink).

与您分享我对该主题的研究,这使我想到了这个问题(作为答案)

推荐答案

此说明着重于 Debezium SQL Server CDC连接器 JDBC连接器,对 Debezium

This explanation focuses on the differences between Debezium SQL Server CDC Connector and JDBC Connector, with more general interpretation about Debezium and CDC.

Debezium仅用作源连接器,记录所有行级更改.
Debezium文档说:

Debezium是一组分布式服务,可捕获您的更改 数据库,以便您的应用程序可以看到这些更改并做出响应 给他们. Debezium记录每个数据库中的所有行级更改 更改事件流中的表,应用程序只需读取这些 流以相同的顺序查看更改事件 发生.

Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

用于SQL Server的Debezium Connector首先记录数据库的快照,然后将行级更改的记录发送到Kafka,每个表发送到不同的Kafka主题.
用于SQL Server文档的Debezium连接器说:

Debezium的SQL Server连接器可以监视和记录行级别 SQL Server数据库架构中的更改.

Debezium’s SQL Server Connector can monitor and record the row-level changes in the schemas of a SQL Server database.

第一次连接到SQL Server数据库/群集时,它读取 所有模式的一致快照.该快照何时 完成后,连接器将持续流式传输已进行的更改 致力于SQL Server并生成相应的插入,更新和 删除事件.每个表的所有事件都记录在一个 单独的Kafka主题,可以轻松地在其中使用 应用程序和服务.

The first time it connects to a SQL Server database/cluster, it reads a consistent snapshot of all of the schemas. When that snapshot is complete, the connector continuously streams the changes that were committed to SQL Server and generates corresponding insert, update and delete events. All of the events for each table are recorded in a separate Kafka topic, where they can be easily consumed by applications and services.



Kafka Connect JDBC既可以用作Kafka的源连接器,也可以用作接收器的接收器连接器,支持任何具有JDBC驱动程序的数据库.
JDBC连接器文档说:

您可以使用Kafka Connect JDBC源连接器导入数据 使用JDBC驱动程序从任何关系数据库导入ApacheKafka® 主题.您可以使用JDBC接收器连接器从Kafka导出数据 JDBC驱动程序为任何关系数据库提供主题. JDBC 连接器无需支持即可支持各种数据库 每一个的自定义代码.

You can use the Kafka Connect JDBC source connector to import data from any relational database with a JDBC driver into Apache Kafka® topics. You can use the JDBC sink connector to export data from Kafka topics to any relational database with a JDBC driver. The JDBC connector supports a wide variety of databases without requiring custom code for each one.

他们有一些规范关于在Microsoft SQL Server上进行安装的信息,与本次讨论无关.

They have some specifications about installing on Microsoft SQL Server which I find non relevant for this discussion.

因此,如果JDBC连接器同时支持源和接收器,而Debezium仅支持源(不提供接收器),则我们了解到,为了使用JDBC驱动程序(接收器)将数据从Kafka写入数据库,JDBC连接器是必经之路(包括SQL Server).

So if JDBC Connector supports both source and sink and Debezium supports only source (not sink), we understand that in order to write data from Kafka to databases with a JDBC driver (sink), the JDBC Connector is the way to go (including SQL Server).

现在,应仅将比较范围缩小到来源"字段.
JDBC Source Connector文档并没有透露更多信息:

Now the comparison should be narrowed only to the sources field.
JDBC Source Connector Documentation doesn't say much more at first sight:

通过定期执行SQL查询并创建一个 结果集中每一行的输出记录.默认情况下,所有表 复制数据库中的文件,每个文件都复制到其自己的输出主题.数据库 监视新表或删除表并自动调整.什么时候 从表中复制数据,连接器只能加载新的或修改的 通过指定应使用哪些列来检测新列或新列 修改后的数据.

Data is loaded by periodically executing a SQL query and creating an output record for each row in the result set. By default, all tables in a database are copied, each to its own output topic. The database is monitored for new or deleted tables and adapts automatically. When copying data from a table, the connector can load only new or modified rows by specifying which columns should be used to detect new or modified data.


在此


Searching a little further in order to understand their differences, in this Debezium blog which uses Debezium MySQL Connector as a source and JDBC Connector as a sink, there is an explanation about the differences between the two, which generally telling us that Debezium provides records with more information about the database changes, while JDBC Connector provides records which are more focused about converting the database changes into simple insert/upsert commands:

Debezium MySQL连接器旨在专门捕获 数据库更改,并提供有关以下内容的尽可能多的信息 这些事件不仅仅是每行的新状态.同时, 融合的JDBC Sink连接器旨在简单地转换每个 消息根据数据库的结构插入/插入数据库 信息.因此,两个连接器的结构不同 消息,但它们也使用不同的主题命名约定和 代表已删除记录的行为.

The Debezium MySQL Connector was designed to specifically capture database changes and provide as much information as possible about those events beyond just the new state of each row. Meanwhile, the Confluent JDBC Sink Connector was designed to simply convert each message into a database insert/upsert based upon the structure of the message. So, the two connectors have different structures for the messages, but they also use different topic naming conventions and behavior of representing deleted records.

此外,它们具有不同的主题命名和不同的删除方法:

Moreover, they have different topic naming and different delete methods:

Debezium对代表以下主题的目标主题使用完全限定的命名 它管理的每个表.命名遵循模式 [逻辑名称].[数据库名称].[表名称]. Kafka Connect JDBC 连接器使用简单名称[table-name].

Debezium uses fully qualified naming for target topics representing each table it manages. The naming follows the pattern [logical-name].[database-name].[table-name]. Kafka Connect JDBC Connector works with simple names [table-name].

...

当Debezium连接器检测到行被删除时,它将创建两个 事件消息:删除事件和逻辑删除消息.删除 邮件中有一个信封,其中包含已删除行的状态 before字段,以及after字段,该字段为null.墓碑消息 包含与删除消息相同的键,但是整个消息值 为空,并且Kafka的日志压缩利用这一点知道它可以 使用相同的密钥删除所有较早的消息.多个水槽 连接器(包括Confluent的JDBC Sink连接器)不是 期待这些消息,并且如果看到任何一种都会失败 的消息.

When the Debezium connector detects a row is deleted, it creates two event messages: a delete event and a tombstone message. The delete message has an envelope with the state of the deleted row in the before field, and an after field that is null. The tombstone message contains same key as the delete message, but the entire message value is null, and Kafka’s log compaction utilizes this to know that it can remove any earlier messages with the same key. A number of sink connectors, including the Confluent’s JDBC Sink Connector, are not expecting these messages and will instead fail if they see either kind of message.

This Confluent blog explains more how CDC and JDBC Connector works, it (JDBC Connector) executing queries to the source database every fixed interval, which is not very scalable solution, while CDC has higher frequency, streaming from the database transaction log:

连接器通过对JDBC执行查询来进行工作 源数据库.它这样做是为了拉入所有行(批量)或那些 自以前以来已更改(增量).该查询在 在poll.interval.ms中定义的时间间隔.取决于数据量 涉及的物理数据库设计(索引等),以及其他 数据库上的工作负载,这可能无法证明是最具可扩展性的 选项.

The connector works by executing a query, over JDBC, against the source database. It does this to pull in all rows (bulk) or those that changed since previously (incremental). This query is executed at the interval defined in poll.interval.ms. Depending on the volumes of data involved, the physical database design (indexing, etc.), and other workload on the database, this may not prove to be the most scalable option.

...

做得正确,CDC基本上使您能够流式传输每个事件 从数据库导入Kafka.概括地说,关系数据库使用 事务日志(也称为binlog或重做日志,具体取决于数据库 风味),数据库中的每个事件均写入其中.更新一个 行,插入行,删除行–所有操作都转到数据库的 交易记录. CDC工具通常通过利用 事务日志以非常低的延迟和较低的影响提取 数据库(或其中的架构/表)上发生的事件 它).

Done properly, CDC basically enables you to stream every single event from a database into Kafka. Broadly put, relational databases use a transaction log (also called a binlog or redo log depending on DB flavour), to which every event in the database is written. Update a row, insert a row, delete a row – it all goes to the database’s transaction log. CDC tools generally work by utilising this transaction log to extract at very low latency and low impact the events that are occurring on the database (or a schema/table within it).

此博客还说明了CDC和JDBC连接器之间的区别,主要是说 JDBC连接器不支持同步已删除的记录,因此适合原型制作,而CDC适合更成熟的系统:

This blog also states the differences between CDC and JDBC Connector, mainly says that JDBC Connector doesn't support syncing deleted records thus fits for prototyping, and CDC fits for more mature systems:

JDBC连接器无法获取已删除的行.因为,你怎么样 查询不存在的数据?

The JDBC Connector cannot fetch deleted rows. Because, how do you query for data that doesn’t exist?

...

我对CDC和JDBC的一般看法是JDBC非常适合用于原型制作, 和精细的小批量工作负载.使用JDBC时应考虑的事项 连接器:

My general steer on CDC vs JDBC is that JDBC is great for prototyping, and fine low-volume workloads. Things to consider if using the JDBC connector:

不提供真实的CDC(捕获删除记录,需要之前/之后 记录版本)检测新事件的延迟 源数据库不断(并将其与所需的数据库保持平衡) 延迟),除非您要从表中进行批量提取,否则您需要 具有可用于发现新记录的ID和/或时间戳.如果 您不拥有该架构,这将成为一个问题.

Doesn’t give true CDC (capture delete records, want before/after record versions) Latency in detecting new events Impact of polling the source database continually (and balancing this with the desired latency) Unless you’re doing a bulk pull from a table, you need to have an ID and/or timestamp that you can use to spot new records. If you don’t own the schema, this becomes a problem.


tl; dr结论

Debezium和JDBC连接器之间的主要区别是:


tl;dr Conclusion

The main differences between Debezium and JDBC Connector are:

  1. Debezium仅用作Kafka源,而JDBC Connector可用作Kafka源和接收器.

对于来源:

  1. JDBC连接器不支持同步已删除的记录,而Debezium支持.
  2. JDBC连接器每隔固定的时间间隔查询一次数据库,这不是一个非常可扩展的解决方案,而CDC的频率更高,它从数据库事务日志中流式传输.
  3. Debezium为记录提供了有关数据库更改的更多信息,而JDBC连接器提供的记录更侧重于将数据库更改转换为简单的插入/更新命令.
  4. 不同的主题命名.

这篇关于Kafka Connect JDBC与Debezium CDC的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆