从S3创建带有列的雅典娜表作为非结构化JSON [英] Create athena table with column as unstructured JSON from S3
问题描述
我当前正在按如下方式创建Athena表:
I am currently creating an Athena table as follows:
CREATE EXTERNAL TABLE `foo_streaming`(
`type` string,
`message` struct<a:string,b:string,c:string>)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://foo/data'
但是,我不想将 message 结构视为结构化数据,而是将其读取为JSON blob,因为数据随时可能发生变化.我该如何使用Athena?
However, instead of treating the message struct as structured data, I would like to read it as a JSON blob, because the data could change at any point. How do I do this with Athena?
我尝试了以下操作,但它给了我一个错误.我尝试使用Google搜索,但一无所获.
I tried the following, but it gives me an error. I tried googling, but found nothing.
CREATE EXTERNAL TABLE `foo_streaming`(
`type` string,
`message` JSON)
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://foo/data'
来自S3的样本数据想要:
Sample data from S3 would like like:
{ "type": "GTF", "message": { "a": 1, "b": 2 } }
{ "type": "GTB", "message": { "c": 1, "d": 2, "x": { "testid": "abc" } } }
{ "type": "GTE", "message": { "error_code": 1 } }
推荐答案
Use string
as type, and then Athena/Presto's JSON functions to extract values from the blobs.
您可以在有关该文档的文档中看到此解决方案的实际操作如何使用Athena查询CloudTrail日志. requestparameters
和 responseelements
属性是JSON,但是与服务相关,因此无法用结构来描述.
You can see this solution in action in the documentation for how to query CloudTrail logs with Athena. The requestparameters
, and responseelements
properties are JSON, but are service-dependent, and therefore can't be described with a struct.
json
不能用作类型,因为它不是Hive识别的类型,这是Presto,IIRC.无论是否支持类型,还是根据DDL或DML(例如 string
和 varchar
)调用不同的名称,我通常都会感到困惑.
json
doesn't work as a type because it's not a type recognised by Hive, it's a Presto thing, IIRC. I find it pretty confusing in general when types are supported or not, or called different things depending on whether it's DDL or DML (e.g. string
and varchar
).
这篇关于从S3创建带有列的雅典娜表作为非结构化JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!