如何在Avro中将记录与地图混合? [英] How to mix record with map in Avro?

查看:34
本文介绍了如何在Avro中将记录与地图混合?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理JSON格式的服务器日志,我想将日志以Parquet格式存储在AWS S3上(Parquet需要Avro模式).首先,所有日志都有一个通用的字段集,其次,所有日志都有很多不在通用集中的可选字段.

I'm dealing with server logs which are JSON format, and I want to store my logs on AWS S3 in Parquet format(and Parquet requires an Avro schema). First, all logs have a common set of fields, second, all logs have a lot of optional fields which are not in the common set.

例如,以下是三个日志:

For example, the follwoing are three logs:

{ "ip": "172.18.80.109", "timestamp": "2015-09-17T23:00:18.313Z", "message":"blahblahblah"}
{ "ip": "172.18.80.112", "timestamp": "2015-09-17T23:00:08.297Z", "message":"blahblahblah", "microseconds": 223}
{ "ip": "172.18.80.113", "timestamp": "2015-09-17T23:00:08.299Z", "message":"blahblahblah", "thread":"http-apr-8080-exec-1147"}

这三个日志均具有3个共享字段: ip timestamp message ,其中一些日志还具有其他字段,例如 microseconds thread .

All of the three logs have 3 shared fields: ip, timestamp and message, some of the logs have additional fields, such as microseconds and thread.

如果我使用以下架构,那么我将丢失所有其他字段.

If I use the following schema then I will lose all additional fields.:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"}
 ]
}

并且以下架构可以正常工作:

And the following schema works fine:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"},
     {"name": "microseconds", "type": [null,long]},
     {"name": "thread", "type": [null,string]}
 ]
}

但是唯一的问题是,除非我扫描所有日志,否则我不知道可选字段的所有名称,此外,将来还会有新的其他字段.

But the only problem is that I don't know all the names of optional fields unless I scan all the logs, besides, there will new additional fields in future.

然后我想出一个结合了 record map 的想法:

Then I think out an idea that combines record and map:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
     {"name": "ip", "type": "string"},
     {"name": "timestamp",  "type": "String"},
     {"name": "message", "type": "string"},
     {"type": "map", "values": "string"}  // error
 ]
}

不幸的是,这无法编译:

Unfortunately this won't compile:

java -jar avro-tools-1.7.7.jar compile schema example.avro .

它将抛出错误:

Exception in thread "main" org.apache.avro.SchemaParseException: No field name: {"type":"map","values":"long"}
    at org.apache.avro.Schema.getRequiredText(Schema.java:1305)
    at org.apache.avro.Schema.parse(Schema.java:1192)
    at org.apache.avro.Schema$Parser.parse(Schema.java:965)
    at org.apache.avro.Schema$Parser.parse(Schema.java:932)
    at org.apache.avro.tool.SpecificCompilerTool.run(SpecificCompilerTool.java:73)
    at org.apache.avro.tool.Main.run(Main.java:84)
    at org.apache.avro.tool.Main.main(Main.java:73)

有没有一种方法可以灵活地处理未知的可选字段,以Avro格式存储JSON字符串?

Is there a way to store JSON strings in Avro format which are flexible to deal with unknown optional fields?

基本上这是一个架构演变问题,Spark可以通过

Basically this is a schema evolution problem, Spark can deal with this problem by Schema Merging. I'm seeking a solution with Hadoop.

推荐答案

地图类型是avro术语中的复杂"类型.以下代码段有效:

The map type is a "complex" type in avro terminology. The below snippet works:

{"namespace": "example.avro",
 "type": "record",
 "name": "Log",
 "fields": [
   {"name": "ip", "type": "string"},
   {"name": "timestamp",  "type": "string"},
   {"name": "message", "type": "string"},
   {"name": "additional", "type": {"type": "map", "values": "string"}}
  ]
}

这篇关于如何在Avro中将记录与地图混合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆