如何在Avro中将记录与地图混合? [英] How to mix record with map in Avro?

查看：34 发布时间：2021/4/12 20:51:57 avro

本文介绍了如何在Avro中将记录与地图混合?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理JSON格式的服务器日志，我想将日志以Parquet格式存储在AWS S3上(Parquet需要Avro模式).首先，所有日志都有一个通用的字段集，其次，所有日志都有很多不在通用集中的可选字段.

I'm dealing with server logs which are JSON format, and I want to store my logs on AWS S3 in Parquet format(and Parquet requires an Avro schema). First, all logs have a common set of fields, second, all logs have a lot of optional fields which are not in the common set.

例如，以下是三个日志:

For example, the follwoing are three logs:

{ "ip": "172.18.80.109", "timestamp": "2015-09-17T23:00:18.313Z", "message":"blahblahblah"}
{ "ip": "172.18.80.112", "timestamp": "2015-09-17T23:00:08.297Z", "message":"blahblahblah", "microseconds": 223}
{ "ip": "172.18.80.113", "timestamp": "2015-09-17T23:00:08.299Z", "message":"blahblahblah", "thread":"http-apr-8080-exec-1147"}

这三个日志均具有3个共享字段: ip ， timestamp 和 message ，其中一些日志还具有其他字段，例如 microseconds 和 thread .

All of the three logs have 3 shared fields: ip, timestamp and message, some of the logs have additional fields, such as microseconds and thread.

如果我使用以下架构，那么我将丢失所有其他字段.

If I use the following schema then I will lose all additional fields.:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"}
 ]
}

并且以下架构可以正常工作:

And the following schema works fine:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"},
     {"name": "microseconds", "type": [null,long]},
     {"name": "thread", "type": [null,string]}
 ]
}

但是唯一的问题是，除非我扫描所有日志，否则我不知道可选字段的所有名称，此外，将来还会有新的其他字段.

But the only problem is that I don't know all the names of optional fields unless I scan all the logs, besides, there will new additional fields in future.

然后我想出一个结合了 record 和 map 的想法:

Then I think out an idea that combines record and map:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"},
     {"type": "map", "values": "string"}  // error
 ]
}

不幸的是，这无法编译:

Unfortunately this won't compile:

java -jar avro-tools-1.7.7.jar compile schema example.avro .

它将抛出错误:

Exception in thread "main" org.apache.avro.SchemaParseException: No field name: {"type":"map","values":"long"}
    at org.apache.avro.Schema.getRequiredText(Schema.java:1305)
    at org.apache.avro.Schema.parse(Schema.java:1192)
    at org.apache.avro.Schema$Parser.parse(Schema.java:965)
    at org.apache.avro.Schema$Parser.parse(Schema.java:932)
    at org.apache.avro.tool.SpecificCompilerTool.run(SpecificCompilerTool.java:73)
    at org.apache.avro.tool.Main.run(Main.java:84)
    at org.apache.avro.tool.Main.main(Main.java:73)

有没有一种方法可以灵活地处理未知的可选字段，以Avro格式存储JSON字符串?

Is there a way to store JSON strings in Avro format which are flexible to deal with unknown optional fields?

基本上这是一个架构演变问题，Spark可以通过

Basically this is a schema evolution problem, Spark can deal with this problem by Schema Merging. I'm seeking a solution with Hadoop.

如何在Avro中将记录与地图混合? [英] How to mix record with map in Avro?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Avro中将记录与地图混合? [英] How to mix record with map in Avro?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭