如何在Avro中将记录与地图混合? [英] How to mix record with map in Avro?
问题描述
我正在处理JSON格式的服务器日志,我想将日志以Parquet格式存储在AWS S3上(Parquet需要Avro模式).首先,所有日志都有一个通用的字段集,其次,所有日志都有很多不在通用集中的可选字段.
I'm dealing with server logs which are JSON format, and I want to store my logs on AWS S3 in Parquet format(and Parquet requires an Avro schema). First, all logs have a common set of fields, second, all logs have a lot of optional fields which are not in the common set.
例如,以下是三个日志:
For example, the follwoing are three logs:
{ "ip": "172.18.80.109", "timestamp": "2015-09-17T23:00:18.313Z", "message":"blahblahblah"}
{ "ip": "172.18.80.112", "timestamp": "2015-09-17T23:00:08.297Z", "message":"blahblahblah", "microseconds": 223}
{ "ip": "172.18.80.113", "timestamp": "2015-09-17T23:00:08.299Z", "message":"blahblahblah", "thread":"http-apr-8080-exec-1147"}
这三个日志均具有3个共享字段: ip
, timestamp
和 message
,其中一些日志还具有其他字段,例如 microseconds
和 thread
.
All of the three logs have 3 shared fields: ip
, timestamp
and message
, some of the logs have additional fields, such as microseconds
and thread
.
如果我使用以下架构,那么我将丢失所有其他字段.
If I use the following schema then I will lose all additional fields.:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"}
]
}
并且以下架构可以正常工作:
And the following schema works fine:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"name": "microseconds", "type": [null,long]},
{"name": "thread", "type": [null,string]}
]
}
但是唯一的问题是,除非我扫描所有日志,否则我不知道可选字段的所有名称,此外,将来还会有新的其他字段.
But the only problem is that I don't know all the names of optional fields unless I scan all the logs, besides, there will new additional fields in future.
然后我想出一个结合了 record
和 map
的想法:
Then I think out an idea that combines record
and map
:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "String"},
{"name": "message", "type": "string"},
{"type": "map", "values": "string"} // error
]
}
不幸的是,这无法编译:
Unfortunately this won't compile:
java -jar avro-tools-1.7.7.jar compile schema example.avro .
它将抛出错误:
Exception in thread "main" org.apache.avro.SchemaParseException: No field name: {"type":"map","values":"long"}
at org.apache.avro.Schema.getRequiredText(Schema.java:1305)
at org.apache.avro.Schema.parse(Schema.java:1192)
at org.apache.avro.Schema$Parser.parse(Schema.java:965)
at org.apache.avro.Schema$Parser.parse(Schema.java:932)
at org.apache.avro.tool.SpecificCompilerTool.run(SpecificCompilerTool.java:73)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
有没有一种方法可以灵活地处理未知的可选字段,以Avro格式存储JSON字符串?
Is there a way to store JSON strings in Avro format which are flexible to deal with unknown optional fields?
基本上这是一个架构演变问题,Spark可以通过架构合并.我正在寻找Hadoop的解决方案.
Basically this is a schema evolution problem, Spark can deal with this problem by Schema Merging. I'm seeking a solution with Hadoop.
推荐答案
地图类型是avro术语中的复杂"类型.以下代码段有效:
The map type is a "complex" type in avro terminology. The below snippet works:
{"namespace": "example.avro",
"type": "record",
"name": "Log",
"fields": [
{"name": "ip", "type": "string"},
{"name": "timestamp", "type": "string"},
{"name": "message", "type": "string"},
{"name": "additional", "type": {"type": "map", "values": "string"}}
]
}
这篇关于如何在Avro中将记录与地图混合?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!