AWS Athena 将结构数组导出到 JSON [英] AWS Athena export array of structs to JSON
问题描述
我有一个 Athena 表,其中一些字段具有相当复杂的嵌套格式.S3 中的支持记录是 JSON.沿着这些思路(但我们有更多层次的嵌套):
CREATE EXTERNAL TABLE IF NOT EXISTS test (时间戳加倍,统计数组,dets array, header:struct>>,pos结构)行格式 SERDE 'org.openx.data.jsonserde.JsonSerDe'WITH SERDEPROPERTIES ('ignore.malformed.json'='true')位置 's3://test-bucket/test-folder/'
现在我们需要能够查询数据并将结果导入Python进行分析.由于安全限制,我无法直接连接到 Athena;我需要能够向某人提供查询,然后他们会给我 CSV 结果.
如果我们只是直接选择 * 我们会以一种不太符合 JSON 的格式返回结构/数组列.这是一个示例输入文件条目:
{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3],"header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}
和示例输出:
select * from test时间戳"、统计数据"、dets"、pos""1.520640777666096E9","[{time=15.0, mean=45.23, var=0.31}, {time=19.0, mean=17.315, var=2.612}]","[{coords=[2.4, 3], 0.header={frame=1, seq=1, name=hello}}]","{x=5.0, y=1.4, theta=0.04}"
我希望以更方便的格式导出这些嵌套字段 - 以 JSON 格式导出它们会很棒.
不幸的是,强制转换为 JSON 似乎只适用于映射,而不适用于结构,因为它只是将所有内容都扁平化为数组:
SELECT 时间戳,cast(stats as JSON) as stats,cast(dets as JSON) as dets,cast(pos as JSON) as pos FROM "sampledb"."test"时间戳"、统计数据"、dets"、pos""1.520640777666096E9","[[15.0,45.23,0.31],[19.0,17.315,2.612]]","[[[[2.4,1.7,0.3],[1,1,""你好""]]]","[5.0,1.4,0.04]"
是否有一种转换为 JSON(或其他易于导入的格式)的好方法,还是我应该继续执行自定义解析函数?
我已经浏览了所有文档,但不幸的是,目前似乎没有办法做到这一点.唯一可能的解决方法是
I've got an Athena table where some fields have a fairly complex nested format. The backing records in S3 are JSON. Along these lines (but we have several more levels of nesting):
CREATE EXTERNAL TABLE IF NOT EXISTS test (
timestamp double,
stats array<struct<time:double, mean:double, var:double>>,
dets array<struct<coords: array<double>, header:struct<frame:int,
seq:int, name:string>>>,
pos struct<x:double, y:double, theta:double>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ('ignore.malformed.json'='true')
LOCATION 's3://test-bucket/test-folder/'
Now we need to be able to query the data and import the results into Python for analysis. Because of security restrictions I can't connect directly to Athena; I need to be able to give someone the query and then they will give me the CSV results.
If we just do a straight select * we get back the struct/array columns in a format that isn't quite JSON. Here's a sample input file entry:
{"timestamp":1520640777.666096,"stats":[{"time":15,"mean":45.23,"var":0.31},{"time":19,"mean":17.315,"var":2.612}],"dets":[{"coords":[2.4,1.7,0.3], "header":{"frame":1,"seq":1,"name":"hello"}}],"pos": {"x":5,"y":1.4,"theta":0.04}}
And example output:
select * from test
"timestamp","stats","dets","pos"
"1.520640777666096E9","[{time=15.0, mean=45.23, var=0.31}, {time=19.0, mean=17.315, var=2.612}]","[{coords=[2.4, 1.7, 0.3], header={frame=1, seq=1, name=hello}}]","{x=5.0, y=1.4, theta=0.04}"
I was hoping to get those nested fields exported in a more convenient format - getting them in JSON would be great.
Unfortunately it seems that cast to JSON only works for maps, not structs, because it just flattens everything into arrays:
SELECT timestamp, cast(stats as JSON) as stats, cast(dets as JSON) as dets, cast(pos as JSON) as pos FROM "sampledb"."test"
"timestamp","stats","dets","pos"
"1.520640777666096E9","[[15.0,45.23,0.31],[19.0,17.315,2.612]]","[[[2.4,1.7,0.3],[1,1,""hello""]]]","[5.0,1.4,0.04]"
Is there a good way to convert to JSON (or another easy-to-import format) or should I just go ahead and do a custom parsing function?
I have skimmed through all the documentation and unfortunately there seems to be no way to do this as of now. The only possible workaround is
converting a struct to a json when querying athena
SELECT
my_field,
my_field.a,
my_field.b,
my_field.c.d,
my_field.c.e
FROM
my_table
Or I would convert the data to json using post processing. Below script shows how
#!/usr/bin/env python
import io
import re
pattern1 = re.compile(r'(?<={)([a-z]+)=', re.I)
pattern2 = re.compile(r':([a-z][^,{}. []]+)', re.I)
pattern3 = re.compile(r'\"', re.I)
with io.open("test.csv") as f:
headers = list(map(lambda f: f.strip(), f.readline().split(",")))
for line in f.readlines():
orig_line = line
data = []
for i, l in enumerate(line.split('","')):
data.append(headers[i] + ":" + re.sub('^"|"$', "", l))
line = "{" + ','.join(data) + "}"
line = pattern1.sub(r'"1":', line)
line = pattern2.sub(r':"1"', line)
print(line)
The output on your input data is
{"timestamp":1.520640777666096E9,"stats":[{"time":15.0, "mean":45.23, "var":0.31}, {"time":19.0, "mean":17.315, "var":2.612}],"dets":[{"coords":[2.4, 1.7, 0.3], "header":{"frame":1, "seq":1, "name":"hello"}}],"pos":{"x":5.0, "y":1.4, "theta":0.04}
}
Which is a valid JSON
这篇关于AWS Athena 将结构数组导出到 JSON的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!