BSONObject的RDD到DataFrame [英] RDD of BSONObject to a DataFrame
本文介绍了BSONObject的RDD到DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
I'm loading a bson
dump from Mongo into Spark as described here. It works, but what I get is:
org.apache.spark.rdd.RDD [(Object,org.bson.BSONObject)]
它基本上应该只是带有所有 String
字段的JSON.我其余的代码需要一个DataFrame对象来处理数据.但是,当然, toDF
在该RDD上失败.如何将其所有字段转换为 String
的Spark DataFrame?类似于 spark.read.json
的东西会很棒.
It should basically be just JSON with all String
fields. The rest of my code requires a DataFrame object to manipulate the data. But, of course, toDF
fails on that RDD. How can I convert it to a Spark DataFrame with all fields as String
? Something similar to spark.read.json
would be great to have.
推荐答案
尝试以下代码
def parseData(s:String)={
val doc=org.bson.Document.parse(s)
val jsonDoc=com.mongodb.util.JSON.serialize(doc)
jsonDoc
val df=spark.read.json(spark.sparkContext.newAPIHadoopFile("src//main//resources//MyDummyData",classOf[BSONFileInputFormat].asSubclass(classOf[org.apache.hadoop.mapreduce.lib.input.FileInputFormat[Object,BSONObject]]), classOf[Object], classOf[BSONObject]).map(x=>x._2).map(x=>parseData(x.toString)))
这篇关于BSONObject的RDD到DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文