在 hive 中查询多行 JSON 文件 [英] Multi-line JSON file querying in hive
问题描述
我了解大多数 JSON SerDe 格式都希望将 .json
文件存储为每行一条记录.
我有一个带有多行缩进 .json
文件的 S3 存储桶(不控制源),我想使用 Amazon Athena 查询(尽管我认为这也适用一般到 Hive).
- 是否有可以解析多行缩进
.json
文件的 SerDe 格式? - 如果没有可以执行此操作的 SerDe 格式:
- 是否有处理此类文件的最佳实践?
- 我是否应该计划使用 Python 等其他工具将这些记录展平?
- 是否有编写自定义 SerDe 格式的标准方法,以便我可以自己编写?
- 是否有处理此类文件的最佳实践?
示例文件体:
<预><代码>[{身份证":1,"name": "ryan",东西: {x":真的,y":[123,456]},},...]遗憾的是没有支持多行 JSON 内容的 serde.有专门的 CloudTrail serde 支持与您的格式类似的格式,但它仅针对 CloudTrail JSON 格式进行了硬编码 - 但至少它表明它至少在理论上是可能的.不过,目前无法编写自己的 serdes 以与 Athena 一起使用.
您将无法通过 Athena 使用这些文件,您必须先使用 EMR、Glue 或其他一些工具将它们重新格式化为 JSON 流文件.
I understand that the majority of JSON SerDe formats expect .json
files to be stored with one record per line.
I have an S3 bucket with multi-line indented .json
files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).
- Is there a SerDe format out there that is able to parse multi-line indented
.json
files? - If there isn't a SerDe format to do this:
- Is there a best practice for dealing with files like this?
- Should I plan on flattening these records out using a different tool like python?
- Is there a standard way of writing custom SerDe formats, so I can write one myself?
- Is there a best practice for dealing with files like this?
Example file body:
[
{
"id": 1,
"name": "ryan",
"stuff: {
"x": true,
"y": [
123,
456
]
},
},
...
]
There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.
You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.
这篇关于在 hive 中查询多行 JSON 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!