在AWS Athena中提取数组中的json [英] extract json in array in AWS Athena
问题描述
我已将kubernetes的日志发送到S3存储桶,并希望使用Athena进行查询
I have sent logs from kubernetes to an S3 bucket and want to query it using Athena
日志看起来像这样
[{ "date":1589895855.077230,
"log":"192.168.85.35 - - [19/May/2020:13:44:15 +0000] \"GET /healthz HTTP/1.1\" 200 3284 \"-\" \"ELB-HealthChecker/2.0\" \"-\"",
"stream":"stdout",
"time":"2020-05-19T13:44:15.077230187Z",
"kubernetes":{
"pod_name":"myapp-deployment-cd984ffb-kjfbm",
"namespace_name":"master",
"pod_id":"eace0175-99cd-11ea-95e4-0aee746ae5d6",
"labels":{
"app":"myapp",
"pod-template-hash":"cd984ffb"
},
"annotations":{
"cluster-autoscaler.kubernetes.io/safe-to-evict":"false",
"kubernetes.io/psp":"eks.privileged"
},
"host":"ip-1-1-1-1.eu-north-1.compute.internal",
"container_name":"myapp",
"docker_id":"cb2cda1ed46c5f09d15090fc3f654b1de35970001e366923287cfbd4a4abf4a1"
}
},
{ "date":1589995860.077230,
"log":"192.168.1.40 - - [20/May/2020:17:31:00 +0000] \"GET /healthz HTTP/1.1\" 200 3284 \"-\" \"ELB-HealthChecker/2.0\" \"-\"",
"stream":"stdout",
"time":"2020-05-20T17:31:00.077230187Z",
"kubernetes":{
"pod_name":"myapp-deployment-cd984ffb-kjfbm",
"namespace_name":"master",
"pod_id":"eace0175-99cd-11ea-95e4-0aee746ae5d6",
"labels":{
"app":"myapp",
"pod-template-hash":"cd984ffb"
},
"annotations":{
"cluster-autoscaler.kubernetes.io/safe-to-evict":"false",
"kubernetes.io/psp":"eks.privileged"
},
"host":"ip-1-1-1-1.eu-north-1.compute.internal",
"container_name":"myapp",
"docker_id":"cb2cda1ed46c5f09d15090fc3f654b1de35970001e366923287cfbd4a4abf4a1"
}
},]
所以
我正在Athena中使用CREATE EXTERNAL TABLE查询来创建表。
我尝试过的是:
I am using an CREATE EXTERNAL TABLE query in Athena to create the table. What I have tried is:
CREATE EXTERNAL TABLE IF NOT EXISTS athenadb.mytable (
`data` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/testlog'
TBLPROPERTIES ('has_encrypted_data'='false');
这仅将数组中的第一项读入表中,除非我指定了更多行,例如
This only read the first item in the array into the table, unless I specified more rows such as
data1字符串
data2字符串
data3字符串
但是,由于我不知道数组中有多少个项目,所以我需要更动态的东西。
However since I don't know how many items is in the array I need something more dynamic.
然后我尝试了这个
CREATE EXTERNAL TABLE IF NOT EXISTS athenadb.mytable (
`data` string
)
LOCATION 's3://mybucket/testlog'
TBLPROPERTIES ('has_encrypted_data'='false');
现在我将整个日志(两个条目)都放在表的一行中。
Now I get the entire log (both entries) in one row in the table.
在这里,我尝试使用UNNEST,但是出现无法取消嵌套类型:varchar的错误
From here I have tried to use UNNEST but I get errors that "cannot unnest type: varchar"
将每个{}放入表中自己的行的最简单方法?也许可以从CREATE EXTERNAL TABLE中完成,而之后不需要任何额外的查询?
What would be the simplest way to get each {} into its own row in the table? Maybe done from the CREATE EXTERNAL TABLE without needing any extra queries afterwards?
编辑:
现在
SELECT data
FROM mytable
CROSS JOIN UNNEST(CAST(json_parse(data) AS array)) AS data2
但是我得到了未知类型:数组
But I get "Unknown type: array"
我在这里找到了类似的问题:如何将数据数组导入到蜂巢表的单独行中?
I found a similar question here: How do I import an array of data into separate rows in a hive table?
但是没有
推荐答案
合并 unnest
与将 json
广播到 array(json)
:
Combine unnest
with casting json
to array(json)
:
SELECT data, e
FROM mytable
CROSS JOIN UNNEST(CAST(json_parse(data) AS array(json))) t(e)
注意: array< json>
是 array(json)
类型定义的旧版本。后者符合SQL标准。
Note: array<json>
is a legacy version of array(json)
type definition. The latter is SQL standard compliant.
这篇关于在AWS Athena中提取数组中的json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!