Kafka Streams内部数据管理 [英] Kafka Streams Internal Data Management

查看:88
本文介绍了Kafka Streams内部数据管理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的公司中,我们广泛使用Kafka,但是出于容错的原因,我们一直在使用关系数据库来存储一些中间转换和聚合的结果.现在,我们正在探索Kafka Streams,将其作为一种更自然的方法.通常,我们的需求很简单-一种这样的情况是

In my company, we are using Kafka extensively, but we have been using relational database to store results of several intermediary transformations and aggregations for fault tolerance reasons. Now we are exploring Kafka Streams as a more natural way to do this. Often, our needs are quite simple - one such case is

  • 收听< K1,V1>,< K2,V2>,< K1,V2>,< K1,V3> ...
  • 的输入队列
  • 对于每条记录,执行一些高延迟的操作(称为远程服务)
  • 如果在处理< K1,V1> 时已经生成了< K1,V2>,< K1,V3> ,则我应该处理V3,因为V2已经过时了
  • Listen to an input queue of <K1,V1>, <K2,V2>, <K1,V2>, <K1,V3>...
  • For each record, perform some high latency operation (call a remote service)
  • If by the time <K1,V1> is processed, and both <K1,V2>, <K1,V3> have been produced, then I should process V3 as V2 has already become stale

为了实现此目的,我将本主题阅读为 KTable .代码如下所示

In-order to achieve this, I am reading the topic as a KTable. Code looks like below

KStreamBuilder builder = new KStreamBuilder();
KTable<String, String> kTable = builder.table("input-topic");
kTable.toStream().foreach((K,V) -> client.post(V));
return builder;

这可以按预期工作,但是我不清楚Kafka如何自动实现这一目标.我以为Kafka可以创建内部主题来实现此目的,但是我看不到任何内部主题被创建. Javadoc方法似乎证实了这一发现.但是后来我遇到了这个官方页面似乎表明Kafka使用了一个单独的数据存储区,即RocksDB和changelog主题.

This works as expected, but it is not clear to me how Kafka automagically achieves this. I was assuming that Kafka creates internal topics to achieve this, but I do not see any internal topics created. Javadoc for the method seem to confirm this observation. But then I came across this official page which seem to suggest that Kafka uses a separate datastore aka RocksDB along with a changelog topic.

现在我很困惑,因为在什么情况下会创建changelog主题.我的问题是

Now I am confused, as under which circumstances are changelog topic created. My questions are

  1. 如果官方页面建议状态存储的默认行为是容错的,那么该状态存储在哪里?在RocksDB中?是在changelog主题中还是在这两个主题中?
  2. 在生产中依赖RocksDB有什么含义?(已编辑)
  1. 据我了解,对rocksdb的依赖是透明的(只是一个jar文件),而rocksdb将数据存储在本地文件系统中.但这也意味着在我们的情况下,该应用程序将在运行应用程序的存储上保留分片数据的副本.当我们用KTable替换远程数据库时,这将对存储产生影响,这就是我的观点.
  2. Kafka发行版是否会确保RocksDB将继续在各种平台上工作?(因为它似乎是平台相关的,而不是用Java编写的)

  • 压缩输入主题日志有意义吗?
  • 我正在使用v.0.11.0

    I am using v. 0.11.0

    推荐答案

    1. Kafka Streams在本地存储状态.默认情况下使用RocksDB.但是,地方政府是短暂的.为了容错,存储的所有更新也都写入了changelog主题中.这样可以在发生故障或横向扩展/横向扩展的情况下重建和/或迁移存储.对于您的特殊情况,不会创建changelog主题,因为 KTable 不是聚合的结果,而是直接从主题填充的-这只是一种优化.由于changelog主题将包含与输入主题完全相同的数据,因此Kafka Streams避免了数据重复,并在出现错误情况时将输入主题用作changelog主题.

    1. Kafka Streams stores state locally. By default using RocksDB. However, local state is ephemeral. For fault-tolerance, all updates to a store are also written into a changelog topic. This allows to rebuild and/or migrate the store in case of failure or scaling in/out. For your special case, no changelog topic is created because the KTable is not the result of an aggregation but populated directly from a topic -- this is an optimization only. Because the changelog topic would contain the exact same data as the input topic, Kafka Streams avoids data duplication and uses the input topic as changelog topic in cause if an error scenario.

    不确定该问题的含义.请注意,RocksDB被视为临时存储.出于多种原因,默认情况下使用了它,如下所述:

    Not sure exactly what you mean by this question. Note, that RocksDB is considered an ephemeral store. It's used by default for various reasons as discussed here: Why Apache Kafka Streams uses RocksDB and if how is it possible to change it? (for example it allows to hold state larger than main-memory as it can spill to disk). You can replace RocksDB with any other store. Kafka Streams also ships with an in-memory store. (Edit)

    1. 是的.您需要相应地配置您的应用程序,以便能够存储整体状态的本地碎片.为此,有一个尺寸指南: https://docs.confluent.io/current/stream/sizing.html

    RocksDB用C ++编写,并通过JNI绑定包括在内.Windows上存在一些已知问题,因为RocksDB并未为所有版本的Windows提供预编译的二进制文件.只要您停留在基于Linux的平台上,它就可以工作.Kafka社区为RocksDB运行升级测试,以确保其兼容.

    RocksDB is written in C++ and included via JNI binding. There are some known issues on Windows as RocksDB does not provide pre-compiled binaries for all versions of Windows. As long as you stay on Linux based platform, it should work. Kafka community runs upgrade tests for RocksDB to make sure it's compatible.

  • 是的.Kafka Streams实际上假定 table()操作的输入主题是紧凑的.否则,存在发生故障时数据丢失的风险.(编辑)

  • Yes. Kafka Streams actually assumes that the input topic for a table() operation is log-compacted. Otherwise, there is the risk of data loss in case of failure. (Edit)

    1. 如果启用对数压缩,则保留时间设置将被忽略.因此,可以,将永远保持最新的更新(或直到写入具有value = null 的墓碑消息).请注意,当在代理端执行压缩时,旧数据将被垃圾收集,因此,在还原时,每个密钥仅读取新版本-旧版本在压缩过程中被删除.如果一段时间后您对某些数据不感兴趣,则需要在源主题中编写一个逻辑删除,以使其起作用.
    1. If you enable log-compaction, retention time setting is ignored. Thus, yes, the latest update will be maintained forever (or until a tombstone message with value=null is written). Note, that when compaction is execute on the broker side, old data is garbage collected and thus, on restore, only the new version per key are read -- old versions got removed during compaction process. If you are not interested in some data after some period of time, you would need to write a tombstone into the source topic to make it work.

  • 这篇关于Kafka Streams内部数据管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆