为什么将 Avro 与 Kafka 一起使用 - 如何处理 POJO [英] Why use Avro with Kafka - How to handle POJOs

查看:30
本文介绍了为什么将 Avro 与 Kafka 一起使用 - 如何处理 POJO的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 spring 应用程序,它是我的 kafka 生产者,我想知道为什么 avro 是最好的方法.我阅读了它以及它提供的所有内容,但是为什么我不能将我用 jackson 创建的 POJO 序列化并将其发送到 kafka?

I have a spring application that is my kafka producer and I was wondering why avro is the best way to go. I read about it and all it has to offer, but why can't I just serialize my POJO that I created myself with jackson for example and send it to kafka?

我这么说是因为 avro 的 POJO 生成不是那么简单.最重要的是,它需要 maven 插件和一个 .avsc 文件.

I'm saying this because the POJO generation from avro is not so straight forward. On top of it, it requires the maven plugin and an .avsc file.

例如,我的 kafka 生产者上有一个 POJO,名为 User:

So for example I have a POJO on my kafka producer created myself called User:

public class User {

    private long    userId;

    private String  name;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public long getUserId() {
        return userId;
    }

    public void setUserId(long userId) {
        this.userId = userId;
    }

}

我将其序列化并将其发送到我在 kafka 中的用户主题.然后我有一个消费者,它本身有一个 POJO 用户并反序列化消息.是空间问题吗?以这种方式序列化和反序列化不是也更快吗?更不用说维护模式注册表的开销了.

I serialize it and send it to my user topic in kafka. Then I have a consumer that itself has a POJO User and deserialize the message. Is it a matter of space? Is it also not faster to serialize and deserialize this way? Not to mention that there is an overhead of maintaining a schema-registry.

推荐答案

你不需要 AVSC,您可以使用 AVDL 文件,它与只有字段的 POJO 基本相同

You don't need AVSC, you can use an AVDL file, which basically looks the same as a POJO with only the fields

@namespace("com.example.mycode.avro")
protocol ExampleProtocol {
   record User {
     long id;
     string name;
   }
}

当使用 Maven 插件的 idl-protocol 目标时,将为您创建此 AVSC,而不是您自己编写.

Which, when using the idl-protocol goal of the Maven plugin, will create this AVSC for you, rather than you writing it yourself.

{
  "type" : "record",
  "name" : "User",
  "namespace" : "com.example.mycode.avro",
  "fields" : [ {
    "name" : "id",
    "type" : "long"
  }, {
    "name" : "name",
    "type" : "string"
  } ]
}

它还会在您的类路径中放置一个 SpecificData POJO User.java 以在您的代码中使用.

And it'll also place a SpecificData POJO User.java on your classpath for using in your code.

如果您已经拥有 POJO,则无需使用 AVSC 或 AVDL 文件.有一些库可以转换 POJO.例如,您 可以使用 Jackson,这不仅适用于 JSON,您可能只需要为 Kafka 创建一个 JacksonAvroSerializer,例如,或者查找是否存在.

If you already had a POJO, you don't need to use AVSC or AVDL files. There are libraries to convert POJOs. For example, you can use Jackson, which is not only for JSON, you would just need to likely create a JacksonAvroSerializer for Kafka, for example, or find if one exists.

Avro 也有 基于反射的内置库.

那么问题 - 为什么是 Avro(对于 Kafka)?

嗯,拥有架构是一件好事.想想 RDBMS 表,你可以解释表,你会看到所有的列.转向 NoSQL 文档数据库,它们实际上可以包含任何内容,这就是 Kafka 的 JSON 世界.

Well, having a schema is a good thing. Think about RDBMS tables, you can explain the table, and you see all the columns. Move to NoSQL document databases, and they can contain literally anything, and this is the JSON world of Kafka.

假设您的 Kafka 集群中有消费者不知道主题中的内容,他们必须确切知道谁/什么已经生成到主题中.他们可以试试console consumer,如果是像JSON这样的明文,那么他们必须找出一些他们感兴趣的字段,然后执行片状HashMap-like .get("name") 操作一次又一次,只有在字段不存在时才会遇到 NPE.使用 Avro,您可以明确定义默认值和可为空的字段.

Let's assume you have consumers in your Kafka cluster that have no idea what is in the topic, they have to know exactly who/what has been produced into a topic. They can try the console consumer, and if it were a plaintext like JSON, then they have to figure out some fields they are interested in, then perform flaky HashMap-like .get("name") operations again and again, only to run into an NPE when a field doesn't exist. With Avro, you clearly define defaults and nullable fields.

您不是必须使用架构注册表,但它为 RDBMS 类比提供了那种类型的 explain topic 语义.它还使您无需随每条消息一起发送模式,以及 Kafka 主题的额外带宽费用.然而,注册表不仅对 Kafka 有用,因为它可以用于 Spark、Flink、Hive 等,用于围绕流数据摄取的所有数据科学分析.

You aren't required to use a Schema Registry, but it provides that type of explain topic semantics for the RDBMS analogy. It also saves you from needing to send the schema along with every message, and the expense of extra bandwidth on the Kafka topic. The registry is not only useful for Kafka, though, as it could be used for Spark, Flink, Hive, etc for all Data Science analysis surrounding streaming data ingest.

假设您确实想使用 JSON,然后尝试改用 MsgPack,您可能会看到您的Kafka 吞吐量并节省代理上的磁盘空间

Assuming you did want to use JSON, then try using MsgPack instead and you'll likely see an increase in your Kafka throughput and save disk space on the brokers

您还可以使用其他格式,如 Protobuf 或 Thrift,优步已比较

You can also use other formats like Protobuf or Thrift, as Uber has compared

这篇关于为什么将 Avro 与 Kafka 一起使用 - 如何处理 POJO的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆