在 AWS Athena 中提取数组中的 json [英] extract json in array in AWS Athena

查看:30
本文介绍了在 AWS Athena 中提取数组中的 json的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将日志从 kubernetes 发送到 S3 存储桶,并想使用 Athena 进行查询

I have sent logs from kubernetes to an S3 bucket and want to query it using Athena

日志看起来像这样

[{      "date":1589895855.077230,
      "log":"192.168.85.35 - - [19/May/2020:13:44:15 +0000] "GET /healthz HTTP/1.1" 200 3284 "-" "ELB-HealthChecker/2.0" "-"",
      "stream":"stdout",
      "time":"2020-05-19T13:44:15.077230187Z",
      "kubernetes":{
         "pod_name":"myapp-deployment-cd984ffb-kjfbm",
         "namespace_name":"master",
         "pod_id":"eace0175-99cd-11ea-95e4-0aee746ae5d6",
         "labels":{
            "app":"myapp",
            "pod-template-hash":"cd984ffb"
          },
         "annotations":{
            "cluster-autoscaler.kubernetes.io/safe-to-evict":"false",
            "kubernetes.io/psp":"eks.privileged"
          },
         "host":"ip-1-1-1-1.eu-north-1.compute.internal",
         "container_name":"myapp",
         "docker_id":"cb2cda1ed46c5f09d15090fc3f654b1de35970001e366923287cfbd4a4abf4a1"
      }
},
{      "date":1589995860.077230,
      "log":"192.168.1.40 - - [20/May/2020:17:31:00 +0000] "GET /healthz HTTP/1.1" 200 3284 "-" "ELB-HealthChecker/2.0" "-"",
      "stream":"stdout",
      "time":"2020-05-20T17:31:00.077230187Z",
      "kubernetes":{
         "pod_name":"myapp-deployment-cd984ffb-kjfbm",
         "namespace_name":"master",
         "pod_id":"eace0175-99cd-11ea-95e4-0aee746ae5d6",
         "labels":{
            "app":"myapp",
            "pod-template-hash":"cd984ffb"
          },
         "annotations":{
            "cluster-autoscaler.kubernetes.io/safe-to-evict":"false",
            "kubernetes.io/psp":"eks.privileged"
          },
         "host":"ip-1-1-1-1.eu-north-1.compute.internal",
         "container_name":"myapp",
         "docker_id":"cb2cda1ed46c5f09d15090fc3f654b1de35970001e366923287cfbd4a4abf4a1"
      }
},]

所以基本上是一个带有 json 对象的数组.

So an array with json object in it basically.

我在 Athena 中使用 CREATE EXTERNAL TABLE 查询来创建表.我试过的是:

I am using an CREATE EXTERNAL TABLE query in Athena to create the table. What I have tried is:

CREATE EXTERNAL TABLE IF NOT EXISTS athenadb.mytable (
                   `data` string
                 )
                 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
                 LOCATION 's3://mybucket/testlog'
                 TBLPROPERTIES ('has_encrypted_data'='false');

这只会将数组中的第一项读入表中,除非我指定了更多行如

This only read the first item in the array into the table, unless I specified more rows such as

data1 字符串

data2 字符串

data3 字符串

然而,由于我不知道数组中有多少项,我需要一些更动态的东西.

However since I don't know how many items is in the array I need something more dynamic.

然后我尝试了这个

CREATE EXTERNAL TABLE IF NOT EXISTS athenadb.mytable (
                   `data` string
                 )
                 LOCATION 's3://mybucket/testlog'
                 TBLPROPERTIES ('has_encrypted_data'='false');

现在我在表格的一行中获得了整个日志(两个条目).

Now I get the entire log (both entries) in one row in the table.

从这里开始,我尝试使用 UNNEST,但出现无法取消嵌套类型:varchar"的错误

From here I have tried to use UNNEST but I get errors that "cannot unnest type: varchar"

将每个 {} 放入表中自己行的最简单方法是什么?也许从 CREATE EXTERNAL TABLE 完成之后不需要任何额外的查询?

What would be the simplest way to get each {} into its own row in the table? Maybe done from the CREATE EXTERNAL TABLE without needing any extra queries afterwards?

现在也试过了

SELECT data
FROM mytable
CROSS JOIN UNNEST(CAST(json_parse(data) AS array)) AS data2

但我得到未知类型:数组"

But I get "Unknown type: array"

我在这里发现了一个类似的问题:如何将数据数组导入到 hive 表中的单独行中?

I found a similar question here: How do I import an array of data into separate rows in a hive table?

但似乎没有任何建议的解决方案可以产生想要的结果.

But there didn't seem to be any suggested solution that created the wanted result.

推荐答案

Combine unnest 使用 castingjsonarray(json):

Combine unnest with casting json to array(json):

SELECT data, e
FROM mytable
CROSS JOIN UNNEST(CAST(json_parse(data) AS array(json))) t(e)

注意:arrayarray(json) 类型定义的旧版本.后者符合 SQL 标准.

Note: array<json> is a legacy version of array(json) type definition. The latter is SQL standard compliant.

这篇关于在 AWS Athena 中提取数组中的 json的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆