BSONObject的RDD到DataFrame [英] RDD of BSONObject to a DataFrame

查看:101
本文介绍了BSONObject的RDD到DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在按

I'm loading a bson dump from Mongo into Spark as described here. It works, but what I get is:

org.apache.spark.rdd.RDD [(Object,org.bson.BSONObject)]

它基本上应该只是带有所有 String 字段的JSON.我其余的代码需要一个DataFrame对象来处理数据.但是,当然, toDF 在该RDD上失败.如何将其所有字段转换为 String 的Spark DataFrame?类似于 spark.read.json 的东西会很棒.

It should basically be just JSON with all String fields. The rest of my code requires a DataFrame object to manipulate the data. But, of course, toDF fails on that RDD. How can I convert it to a Spark DataFrame with all fields as String? Something similar to spark.read.json would be great to have.

推荐答案

尝试以下代码

def parseData(s:String)={
val doc=org.bson.Document.parse(s)
val jsonDoc=com.mongodb.util.JSON.serialize(doc)
jsonDoc

val df=spark.read.json(spark.sparkContext.newAPIHadoopFile("src//main//resources//MyDummyData",classOf[BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object,BSONObject]]), classOf[Object], classOf[BSONObject]).map(x=>x._2).map(x=>parseData(x.toString)))

这篇关于BSONObject的RDD到DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆