为什么要在Kafka上使用Avro-如何处理POJO [英] Why use Avro with Kafka - How to handle POJOs

查看:347
本文介绍了为什么要在Kafka上使用Avro-如何处理POJO的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个春季应用程序,是我的kafka制作人,我想知道为什么avro是最好的选择. 我读到了有关它的全部内容,但是为什么我不能直接序列化我用杰克逊创建的POJO并将其发送给kafka?

I have a spring application that is my kafka producer and I was wondering why avro is the best way to go. I read about it and all it has to offer, but why can't I just serialize my POJO that I created myself with jackson for example and send it to kafka?

之所以这样说,是因为avro产生的POJO并不那么直接. 最重要的是,它需要Maven插件和一个.avsc文件.

I'm saying this because the POJO generation from avro is not so straight forward. On top of it, it requires the maven plugin and an .avsc file.

例如,我在我的kafka生产者上创建了一个名为JOO的POJO:

So for example I have a POJO on my kafka producer created myself called User:

public class User {

    private long    userId;

    private String  name;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public long getUserId() {
        return userId;
    }

    public void setUserId(long userId) {
        this.userId = userId;
    }

}

我序列化它,并将其发送到我在kafka中的用户主题.然后,我有一个消费者,该消费者本身具有POJO用户并反序列化消息. 这是空间问题吗?以这种方式进行序列化和反序列化还不是更快吗?更不用说维护架构注册表的开销.

I serialize it and send it to my user topic in kafka. Then I have a consumer that itself has a POJO User and deserialize the message. Is it a matter of space? Is it also not faster to serialize and deserialize this way? Not to mention that there is an overhead of maintaining a schema-registry.

推荐答案

您不需要AVSC,您可以使用一个AVDL文件,该文件基本上与仅包含字段的POJO相同

You don't need AVSC, you can use an AVDL file, which basically looks the same as a POJO with only the fields

@namespace("com.example.mycode.avro")
protocol ExampleProtocol {
   record User {
     long id;
     string name;
   }
}

使用Maven插件的idl-protocol目标时,将为您创建此AVSC,而不是您自己编写.

Which, when using the idl-protocol goal of the Maven plugin, will create this AVSC for you, rather than you writing it yourself.

{
  "type" : "record",
  "name" : "User",
  "namespace" : "com.example.mycode.avro",
  "fields" : [ {
    "name" : "id",
    "type" : "long"
  }, {
    "name" : "name",
    "type" : "string"
  } ]
}

并且还将在您的类路径中放置一个SpecificData POJO User.java以便在您的代码中使用.

And it'll also place a SpecificData POJO User.java on your classpath for using in your code.

如果您已经拥有POJO,则无需使用AVSC或AVDL文件.有一些库可以转换POJO.例如,您 可以使用 Jackson ,它不仅适用于JSON,还可能需要为Kafka创建一个JacksonAvroSerializer,或者查找是否存在.

If you already had a POJO, you don't need to use AVSC or AVDL files. There are libraries to convert POJOs. For example, you can use Jackson, which is not only for JSON, you would just need to likely create a JacksonAvroSerializer for Kafka, for example, or find if one exists.

Avro还具有问题-为什么选择Avro(对于Kafka)?

好吧,拥有一个架构是一件好事.考虑一下RDBMS表,您可以解释该表,然后看到所有列.移至NoSQL文档数据库,它们实际上可以包含任何内容,这就是Kafka的JSON世界.

Well, having a schema is a good thing. Think about RDBMS tables, you can explain the table, and you see all the columns. Move to NoSQL document databases, and they can contain literally anything, and this is the JSON world of Kafka.

让我们假设您在Kafka集群中有一些消费者,他们不知道主题中的内容,他们必须确切地知道是谁/主题中产生了什么.他们可以尝试使用控制台使用者,如果它是JSON之类的纯文本格式,则必须找出他们感兴趣的某些字段,然后一次又一次地执行类似HashMap的.get("name")不稳定操作,仅在遇到NPE时一个字段不存在.使用Avro,您清楚地定义默认值和可为空的字段.

Let's assume you have consumers in your Kafka cluster that have no idea what is in the topic, they have to know exactly who/what has been produced into a topic. They can try the console consumer, and if it were a plaintext like JSON, then they have to figure out some fields they are interested in, then perform flaky HashMap-like .get("name") operations again and again, only to run into an NPE when a field doesn't exist. With Avro, you clearly define defaults and nullable fields.

并不需要使用,但是它为RDBMS类比提供了explain topic语义的类型.这也使您无需将架构与每条消息一起发送,并且省去了Kafka主题上额外带宽的开销.该注册表不仅对Kafka有用,因为它可用于Spark,Flink,Hive等,用于围绕流数据提取的所有数据科学分析.

You aren't required to use a Schema Registry, but it provides that type of explain topic semantics for the RDBMS analogy. It also saves you from needing to send the schema along with every message, and the expense of extra bandwidth on the Kafka topic. The registry is not only useful for Kafka, though, as it could be used for Spark, Flink, Hive, etc for all Data Science analysis surrounding streaming data ingest.

假设您确实要使用JSON,然后尝试改用MsgPack ,则您的使用量可能会增加您的Kafka吞吐量并节省代理上的磁盘空间

Assuming you did want to use JSON, then try using MsgPack instead and you'll likely see an increase in your Kafka throughput and save disk space on the brokers

您还可以使用其他格式,例如Protobuf或Thrift,如Uber所比较,

You can also use other formats like Protobuf or Thrift, as Uber has compared

这篇关于为什么要在Kafka上使用Avro-如何处理POJO的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆