配置单元中的多行JSON文件查询 [英] Multi-line JSON file querying in hive
问题描述
我了解大多数 JSON SerDe格式都希望将.json
文件存储为每行一条记录.
I understand that the majority of JSON SerDe formats expect .json
files to be stored with one record per line.
我有一个S3存储桶,其中包含多行缩进的.json
文件(不控制源代码),我想使用Amazon Athena进行查询(尽管我认为这同样适用于Hive). /p>
I have an S3 bucket with multi-line indented .json
files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).
- 那里是否存在SerDe格式,可以解析多行缩进的
.json
文件? - 如果没有 SerDe格式,请执行以下操作:
- 是否有处理此类文件的最佳实践?
- 我是否打算使用其他工具(如python)将这些记录弄平?
- 是否有处理此类文件的最佳实践?
- Is there a SerDe format out there that is able to parse multi-line indented
.json
files? - If there isn't a SerDe format to do this:
- Is there a best practice for dealing with files like this?
- Should I plan on flattening these records out using a different tool like python?
- Is there a best practice for dealing with files like this?
示例文件正文:
[
{
"id": 1,
"name": "ryan",
"stuff: {
"x": true,
"y": [
123,
456
]
},
},
...
]
推荐答案
不幸的是,没有支持多行JSON内容的Serde.有专门的CloudTrail Serde支持类似于您的格式,但仅针对CloudTrail JSON格式进行了硬编码-但至少表明它在理论上是可能的.不过,目前尚无办法编写自己的Serdes以便与Athena一起使用.
There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.
您将无法在Athena上使用这些文件,必须先使用EMR,Glue或其他工具将它们重新格式化为JSON流文件.
You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.
这篇关于配置单元中的多行JSON文件查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!