从S3创建带有列的雅典娜表作为非结构化JSON [英] Create athena table with column as unstructured JSON from S3

查看:61
本文介绍了从S3创建带有列的雅典娜表作为非结构化JSON的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当前正在按如下方式创建Athena表:

I am currently creating an Athena table as follows:

 CREATE EXTERNAL TABLE `foo_streaming`(
  `type` string, 
  `message` struct<a:string,b:string,c:string>)
PARTITIONED BY ( 
  `dt` string)
ROW FORMAT SERDE 
  'org.apache.hive.hcatalog.data.JsonSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://foo/data'

但是,我不想将 message 结构视为结构化数据,而是将其读取为JSON blob,因为数据随时可能发生变化.我该如何使用Athena?

However, instead of treating the message struct as structured data, I would like to read it as a JSON blob, because the data could change at any point. How do I do this with Athena?

我尝试了以下操作,但它给了我一个错误.我尝试使用Google搜索,但一无所获.

I tried the following, but it gives me an error. I tried googling, but found nothing.

CREATE EXTERNAL TABLE `foo_streaming`(
      `type` string, 
      `message` JSON)
    PARTITIONED BY ( 
      `dt` string)
    ROW FORMAT SERDE 
      'org.apache.hive.hcatalog.data.JsonSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
      's3://foo/data'

来自S3的样本数据想要:

Sample data from S3 would like like:

{ "type": "GTF", "message": { "a": 1, "b": 2 } }
{ "type": "GTB", "message": { "c": 1, "d": 2, "x": { "testid": "abc" } } }
{ "type": "GTE", "message": { "error_code": 1 } }

推荐答案

使用 string 作为类型,然后

Use string as type, and then Athena/Presto's JSON functions to extract values from the blobs.

您可以在有关该文档的文档中看到此解决方案的实际操作如何使用Athena查询CloudTrail日志. requestparameters responseelements 属性是JSON,但是与服务相关,因此无法用结构来描述.

You can see this solution in action in the documentation for how to query CloudTrail logs with Athena. The requestparameters, and responseelements properties are JSON, but are service-dependent, and therefore can't be described with a struct.

json 不能用作类型,因为它不是Hive识别的类型,这是Presto,IIRC.无论是否支持类型,还是根据DDL或DML(例如 string varchar )调用不同的名称,我通常都会感到困惑.

json doesn't work as a type because it's not a type recognised by Hive, it's a Presto thing, IIRC. I find it pretty confusing in general when types are supported or not, or called different things depending on whether it's DDL or DML (e.g. string and varchar).

这篇关于从S3创建带有列的雅典娜表作为非结构化JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆