复杂的Json模式到自定义spark数据帧 [英] Complex Json schema into custom spark dataframe

查看:116
本文介绍了复杂的Json模式到自定义spark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,所以我从API调用中得到了一个很大的Json字符串,我想将其中的一些字符串保存到Cassandra中.我正在尝试将Json字符串解析为更像结构的表,但仅包含一些字段.总体架构如下所示:

Ok so I'm getting a big Json string from an API call, and I want to save some of that string into Cassandra. I'm trying to parse the Json string into a more table like structure, but with only some fields. The overall schema looks like this:

我希望我的表结构使用regnum,date和value字段. 使用sqlContext.read.json(vals).select(explode('register) as 'reg).select("reg.@attributes.regnum","reg.data.date","reg.data.value").show,我可以得到一个像这样的表:

And I want my table structure using regnum, date and value fields. With sqlContext.read.json(vals).select(explode('register) as 'reg).select("reg.@attributes.regnum","reg.data.date","reg.data.value").show I can get a table like this:

但是您可以看到日期和值字段是数组.我想每条记录有一个元素,并为每条记录复制对应的regnum.非常感谢您的帮助.

But as you can see date and value fields are arrays. I would like to have one element per record, and duplicate the corresponding regnum for each record. Any help is very much appreciated.

推荐答案

您可以将DataFrame强制转换为Dataset,然后在其上强制flatMap.

You can cast your DataFrame to Dataset then flatMap on it.

 df.select("reg.@attributes.regnum","reg.data.date","reg.data.value")
   .as[(Long, Array[String], Array[String])]
   .flatMap(s => s._2.zip(s._3).map(p => (s._1, p._1, p._2)))

这篇关于复杂的Json模式到自定义spark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆