HdfsSink3Connector 可以创建重复项吗? [英] Can HdfsSink3Connector create duplicates?

查看:20
本文介绍了HdfsSink3Connector 可以创建重复项吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据文档,接收器连接器确保 Exactly-Once-Delivery.

As per the Documentation, Sink Connector ensures Exactly-Once-Delivery.

在连接器任务线程失败的情况下如何确保Exact-Once-Delivery?

How does it ensure Exact-Once-Delivery in case of connector task thread failure?

它是否删除了失败的任务线程创建的文件?还是将损坏的/部分文件留在 HDFS 中?

Does it remove the file created by the failed task thread? Or it leaves the corrupted/partial file in HDFS?

连接器使用预写日志来确保每条记录只写入一次 HDFS.此外,连接器通过将 Kafka 偏移信息编码到 HDFS 文件中来管理偏移,以便在出现故障和任务重新启动时,它可以从最后提交的偏移开始.

The connector uses a write-ahead log to ensure each record is written to HDFS exactly once. Also, the connector manages offsets by encoding the Kafka offset information into the HDFS file so that it can start from the last committed offsets in case of failures and task restarts.

请帮我解决这个问题.

推荐答案

HDFS 连接器正在保存文件名中的偏移量并将它们返回给连接器中的使用者 api,以便知道它需要从哪里继续,在此它提供 Exactly Once Semantics、EOS 并避免重复的方式.

The HDFS Connector is saving the offsets in the filenames and returning them to the consumer api in the connector in order to know from where it needs to continue, in this way it provides Exactly Once Semantics, EOS, and avoid duplicates.

/**

  • HDFS 连接器跟踪 HDFS 中文件名的偏移量(用于 Exactly Once 语义)作为最后一个
  • 写入 HDFS 中最后一个文件的记录偏移量.
  • 此方法返回 HDFS 中最后一个偏移量之后的下一个偏移量,对某些 API 有用
  • (如 Kafka 消费者偏移量跟踪).
  • @return 写入 HDFS 的最后一个偏移量之后的下一个偏移量,如果没有提交文件,则为 -1
yet

*/

https://github.com/confluentinc/kafka-connect-hdfs/blob/1d68023c38e17f0ed6f87f3b78d86c2e08f39909/src/main/java/io/confluent/connect/hdfs/TopicPartitionWriter.java

正在从文件名中读取offset

long lastCommittedOffsetToHdfs = FileUtils.extractOffset(
      fileStatusWithMaxOffset.getPath().getName());
  log.trace("Last committed offset based on filenames: {}", lastCommittedOffsetToHdfs);
  // `offset` represents the next offset to read after the most recent commit
  offset = lastCommittedOffsetToHdfs + 1;
  log.trace("Next offset to read: {}", offset);

如果 hdfs 文件已写入磁盘,则在任务开始时,将从文件名中读取偏移量并从该点继续...

If the hdfs file been written to disk, than on task start , the offset will be read from the filename and continue from that point...

如果文件还没有写入磁盘,则在任务中它会从之前再次开始读取并尝试将文件写入 hdfs,成功时它将提交偏移量,如果提交失败但文件存在于 hdfs 上,在任务开始时从 hdfs 文件继续需要偏移量 –

If the file did not write to disk yet , on task it will start read again from before and try to write the file to hdfs, on success it will commit the offsets, if the commit fails but the file exists on hdfs, it will take the offset to continue from the hdfs file on task start –

这篇关于HdfsSink3Connector 可以创建重复项吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆