复杂的Json模式到自定义spark数据帧 [英] Complex Json schema into custom spark dataframe
问题描述
好的,所以我从API调用中得到了一个很大的Json字符串,我想将其中的一些字符串保存到Cassandra中.我正在尝试将Json字符串解析为更像结构的表,但仅包含一些字段.总体架构如下所示:
Ok so I'm getting a big Json string from an API call, and I want to save some of that string into Cassandra. I'm trying to parse the Json string into a more table like structure, but with only some fields. The overall schema looks like this:
我希望我的表结构使用regnum,date和value字段.
使用sqlContext.read.json(vals).select(explode('register) as 'reg).select("reg.@attributes.regnum","reg.data.date","reg.data.value").show
,我可以得到一个像这样的表:
And I want my table structure using regnum, date and value fields.
With sqlContext.read.json(vals).select(explode('register) as 'reg).select("reg.@attributes.regnum","reg.data.date","reg.data.value").show
I can get a table like this:
但是您可以看到日期和值字段是数组.我想每条记录有一个元素,并为每条记录复制对应的regnum.非常感谢您的帮助.
But as you can see date and value fields are arrays. I would like to have one element per record, and duplicate the corresponding regnum for each record. Any help is very much appreciated.
推荐答案
您可以将DataFrame
强制转换为Dataset
,然后在其上强制flatMap
.
You can cast your DataFrame
to Dataset
then flatMap
on it.
df.select("reg.@attributes.regnum","reg.data.date","reg.data.value")
.as[(Long, Array[String], Array[String])]
.flatMap(s => s._2.zip(s._3).map(p => (s._1, p._1, p._2)))
这篇关于复杂的Json模式到自定义spark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!