带有Avro和架构存储库的Apache Kafka-架构ID在消息中的何处? [英] Apache Kafka with Avro and Schema Repo - where in the message does the schema Id go?

查看:129
本文介绍了带有Avro和架构存储库的Apache Kafka-架构ID在消息中的何处?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Avro序列化我的Kafka消息的数据,并希望将其与Avro架构存储库一起使用,因此我不必在每条消息中都包含该架构.

I want to use Avro to serialize the data for my Kafka messages and would like to use it with an Avro schema repository so I don't have to include the schema with every message.

将Avro与Kafka一起使用似乎是一件很受欢迎的事情,很多博客/Stack Overflow问题/用户组等都引用了将Schema Id与消息一起发送的消息,但我找不到它应该去哪里的实际示例.

Using Avro with Kafka seems like a popular thing to do, and lots of blogs / Stack Overflow questions / usergroups etc reference sending the Schema Id with the message but I cannot find an actual example of where it should go.

我认为它应该放在Kafka消息标题中的某个位置,但是我找不到明显的位置.如果它在Avro消息中,则必须根据模式对其进行解码,以获取消息内容并显示您需要针对其进行解码的模式,这显然存在问题.

I think it should go in the Kafka message header somewhere but I cannot find an obvious place. If it was in the Avro message you would have to decode it against a schema to get the message contents and reveal the schema you need to decode against, which has obvious problems.

我正在使用C#客户端,但是任何语言的示例都很好.消息类具有以下字段:

I am using the C# client but an example in any language would be great. The message class has these fields:

public MessageMetadata Meta { get; set; }
public byte MagicNumber { get; set; }
public byte Attribute { get; set; }
public byte[] Key { get; set; }
public byte[] Value { get; set; }

但这些都不是正确的. MessageMetaData仅具有Offset和PartitionId.

but non of these seem correct. The MessageMetaData only has Offset and PartitionId.

那么,Avro模式ID应该放在哪里?

So, where should the Avro Schema Id go?

推荐答案

模式ID实际上是在avro消息本身中编码的.看看了解编码器/解码器的实现方式.

The schema id is actually encoded in the avro message itself. Take a look at this to see how encoders/decoders are implemented.

通常,当您向Kafka发送Avro消息时会发生什么情况

In general what's happening when you send an Avro message to Kafka:

  1. 编码器从要编码的对象获取模式.
  2. 编码器向架构注册表询问此架构的ID.如果该模式已经注册,您将获得一个现有的ID;否则,注册中心将注册该模式并返回新ID.
  3. 对象的编码方式如下:[magic byte] [schema id] [actual message]其中,magic byte只是一个0x0字节,用于区分这种消息,schema id是一个4字节的整数值剩下的就是实际的编码消息.
  1. The encoder gets the schema from the object to be encoded.
  2. Encoder asks the schema registry for an id for this schema. If the schema is already registered you'll get an existing id, if not - the registry will register the schema and return the new id.
  3. The object gets encoded as follows: [magic byte][schema id][actual message] where magic byte is just a 0x0 byte which is used to distinguish that kind of messages, schema id is a 4 byte integer value the rest is the actual encoded message.

当您解码回消息时,会发生以下情况:

When you decode the message back here's what happens:

  1. 解码器读取第一个字节并确保它是0x0.
  2. 解码器读取接下来的4个字节并将其转换为整数值.这就是对模式ID进行解码的方式.
  3. 现在,当解码器具有模式ID时,它可以向模式注册表查询该ID的实际模式.瞧!

如果您的密钥是Avro编码的,则您的密钥将具有上述格式.价值同样如此.这样,您的键和值既可以是Avro值,也可以使用不同的架构.

If your key is Avro encoded then your key will be of the format described above. The same applies for value. This way your key and value may be both Avro values and use different schemas.

编辑以回答评论中的问题:

Edit to answer the question in comment:

实际的模式存储在模式存储库中(实际上是模式存储库的整个要点-存储模式:). Avro对象容器文件格式与上述格式无关. KafkaAvroEncoder/Decoder使用略有不同的消息格式(但实际消息的编码方式完全相同).

The actual schema is stored in the schema repository (that is the whole point of schema repository actually - to store schemas :)). The Avro Object Container Files format has nothing to do with the format described above. KafkaAvroEncoder/Decoder use slightly different message format (but the actual messages are encoded exactly the same way sure).

这些格式之间的主要区别在于,对象容器文件携带实际的模式,并且可能包含与该模式相对应的多条消息,而上述格式仅包含模式ID和与该模式相对应的一条消息.

The main difference between these formats is that Object Container Files carry the actual schema and may contain multiple messages corresponding to that schema, whereas the format described above carries only the schema id and exactly one message corresponding to that schema.

围绕对象容器文件编码的消息传递可能不太明显,因为一个Kafka消息将包含多个Avro消息.或者,您可以确保一个Kafka消息仅包含一个Avro消息,但这将导致在每个消息中携带模式.

Passing object-container-file-encoded messages around would probably be not obvious to follow/maintain because one Kafka message would then contain multiple Avro messages. Or you could ensure that one Kafka message contains only one Avro message but that would result in carrying schema with each message.

Avro模式可能非常大(我见过600 KB及以上的模式),并且随每条消息一起携带该模式确实会非常昂贵且浪费,因此这就是模式存储库的所在-模式仅获取一次,然后会被本地缓存,而所有其他查询只是快速的地图查询.

Avro schemas can be quite large (I've seen schemas like 600 KB and more) and carrying the schema with each message would be really costly and wasteful so that is where schema repository kicks in - the schema is fetched only once and gets cached locally and all other lookups are just map lookups that are fast.

这篇关于带有Avro和架构存储库的Apache Kafka-架构ID在消息中的何处?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆