配置单元中的多行JSON文件查询 [英] Multi-line JSON file querying in hive

查看:137
本文介绍了配置单元中的多行JSON文件查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解大多数 JSON SerDe格式都希望将.json文件存储为每行一条记录.

I understand that the majority of JSON SerDe formats expect .json files to be stored with one record per line.

我有一个S3存储桶,其中包含多行缩进的.json文件(不控制源代码),我想使用Amazon Athena进行查询(尽管我认为这同样适用于Hive). /p>

I have an S3 bucket with multi-line indented .json files (don't control the source) that I'd like to query using Amazon Athena (though I suppose this applies just as well to Hive generally).

  1. 那里是否存在SerDe格式,可以解析多行缩进的.json文件?
  2. 如果没有 SerDe格式,请执行以下操作:
    • 是否有处理此类文件的最佳实践?
      • 我是否打算使用其他工具(如python)将这些记录弄平?
  1. Is there a SerDe format out there that is able to parse multi-line indented .json files?
  2. If there isn't a SerDe format to do this:
    • Is there a best practice for dealing with files like this?
      • Should I plan on flattening these records out using a different tool like python?

示例文件正文:

[
  {
    "id": 1,
    "name": "ryan",
    "stuff: {
      "x": true,
      "y": [
        123,
        456
      ]
    },
  },
  ...
]

推荐答案

不幸的是,没有支持多行JSON内容的Serde.有专门的CloudTrail Serde支持类似于您的格式,但仅针对CloudTrail JSON格式进行了硬编码-但至少表明它在理论上是可能的.不过,目前尚无办法编写自己的Serdes以便与Athena一起使用.

There is unfortunately no serde that supports multiline JSON content. There is the specialized CloudTrail serde that supports a format similar to yours, but it's hard-coded only for the CloudTrail JSON format – but at least it shows that it's at least theoretically possible. Currently there is no way to write your own serdes to use with Athena, though.

您将无法在Athena上使用这些文件,必须先使用EMR,Glue或其他工具将它们重新格式化为JSON流文件.

You won't be able to consume these files with Athena, you will have to use EMR, Glue, or some other tool to reformat them into JSON stream files first.

这篇关于配置单元中的多行JSON文件查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆