带有 Avro 和 Schema Repo 的 Apache Kafka - 消息中的模式 ID 在哪里? [英] Apache Kafka with Avro and Schema Repo - where in the message does the schema Id go?

查看:25
本文介绍了带有 Avro 和 Schema Repo 的 Apache Kafka - 消息中的模式 ID 在哪里?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用 Avro 来序列化我的 Kafka 消息的数据,并希望将它与 Avro 架构存储库一起使用,这样我就不必在每条消息中都包含架构.

I want to use Avro to serialize the data for my Kafka messages and would like to use it with an Avro schema repository so I don't have to include the schema with every message.

在 Kafka 中使用 Avro 似乎是一件很受欢迎的事情,很多博客/堆栈溢出问题/用户组等都参考发送带有消息的架构 ID,但我找不到它应该去哪里的实际示例.

Using Avro with Kafka seems like a popular thing to do, and lots of blogs / Stack Overflow questions / usergroups etc reference sending the Schema Id with the message but I cannot find an actual example of where it should go.

我认为它应该放在 Kafka 消息头的某个地方,但我找不到明显的地方.如果它在 Avro 消息中,您将必须根据架构对其进行解码以获取消息内容并显示您需要对其进行解码的架构,这有明显的问题.

I think it should go in the Kafka message header somewhere but I cannot find an obvious place. If it was in the Avro message you would have to decode it against a schema to get the message contents and reveal the schema you need to decode against, which has obvious problems.

我使用的是 C# 客户端,但任何语言的示例都很棒.消息类具有以下字段:

I am using the C# client but an example in any language would be great. The message class has these fields:

public MessageMetadata Meta { get; set; }
public byte MagicNumber { get; set; }
public byte Attribute { get; set; }
public byte[] Key { get; set; }
public byte[] Value { get; set; }

但这些似乎都不正确.MessageMetaData 只有 Offset 和 PartitionId.

but non of these seem correct. The MessageMetaData only has Offset and PartitionId.

那么,Avro Schema Id 应该去哪里?

So, where should the Avro Schema Id go?

推荐答案

schema id 实际上编码在 avro 消息本身中.看看这个 查看编码器/解码器是如何实现的.

The schema id is actually encoded in the avro message itself. Take a look at this to see how encoders/decoders are implemented.

一般来说,当您向 Kafka 发送 Avro 消息时会发生什么:

In general what's happening when you send an Avro message to Kafka:

  1. 编码器从要编码的对象中获取架构.
  2. 编码器向架构注册表询问此架构的 ID.如果架构已注册,您将获得现有 ID,否则 - 注册表将注册架构并返回新 ID.
  3. 对象编码如下:[magic byte][schema id][actual message] 其中,magic byte 只是一个 0x0 字节,用于区分那种消息,schema id 是一个 4 字节的整数值,其余是实际编码的消息.
  1. The encoder gets the schema from the object to be encoded.
  2. Encoder asks the schema registry for an id for this schema. If the schema is already registered you'll get an existing id, if not - the registry will register the schema and return the new id.
  3. The object gets encoded as follows: [magic byte][schema id][actual message] where magic byte is just a 0x0 byte which is used to distinguish that kind of messages, schema id is a 4 byte integer value the rest is the actual encoded message.

当您解码消息时,会发生以下情况:

When you decode the message back here's what happens:

  1. 解码器读取第一个字节并确保它是 0x0.
  2. 解码器读取接下来的 4 个字节并将它们转换为整数值.这就是模式 ID 的解码方式.
  3. 现在,当解码器具有架构 ID 时,它可能会向架构注册表询问此 ID 的实际架构.瞧!

如果您的密钥是 Avro 编码的,那么您的密钥将采用上述格式.这同样适用于价值.这样,您的键和值可能都是 Avro 值并使用不同的架构.

If your key is Avro encoded then your key will be of the format described above. The same applies for value. This way your key and value may be both Avro values and use different schemas.

编辑以回答评论中的问题:

实际模式存储在模式存储库中(实际上是模式存储库的全部点 - 存储模式:)).Avro 对象容器文件格式与上述格式无关.KafkaAvroEncoder/Decoder 使用稍微不同的消息格式(但实际消息的编码方式确实完全相同).

The actual schema is stored in the schema repository (that is the whole point of schema repository actually - to store schemas :)). The Avro Object Container Files format has nothing to do with the format described above. KafkaAvroEncoder/Decoder use slightly different message format (but the actual messages are encoded exactly the same way sure).

这些格式之间的主要区别在于对象容器文件携带实际模式,并且可能包含与该模式对应的多条消息,而上述格式仅携带模式 ID 和与该模式对应的仅一条消息.

The main difference between these formats is that Object Container Files carry the actual schema and may contain multiple messages corresponding to that schema, whereas the format described above carries only the schema id and exactly one message corresponding to that schema.

传递对象容器文件编码的消息可能不太容易跟踪/维护,因为一条 Kafka 消息将包含多条 Avro 消息.或者,您可以确保一条 Kafka 消息只包含一条 Avro 消息,但这会导致每条消息都带有架构.

Passing object-container-file-encoded messages around would probably be not obvious to follow/maintain because one Kafka message would then contain multiple Avro messages. Or you could ensure that one Kafka message contains only one Avro message but that would result in carrying schema with each message.

Avro 模式可能非常大(我见过像 600 KB 甚至更多的模式),并且将模式与每条消息一起携带会非常昂贵和浪费,因此这就是模式存储库的作用所在 - 模式仅被获取一次并且在本地缓存,所有其他查找都只是快速的地图查找.

Avro schemas can be quite large (I've seen schemas like 600 KB and more) and carrying the schema with each message would be really costly and wasteful so that is where schema repository kicks in - the schema is fetched only once and gets cached locally and all other lookups are just map lookups that are fast.

这篇关于带有 Avro 和 Schema Repo 的 Apache Kafka - 消息中的模式 ID 在哪里?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆