为什么Spark/Scala编译器在RDD [Map [Int,Int]]上找不到toDF? [英] Why does Spark/Scala compiler fail to find toDF on RDD[Map[Int, Int]]?
问题描述
为什么以下内容最终会出错?
Why does the following end up with an error?
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> val rdd = sc.parallelize(1 to 10).map(x => (Map(x -> 0), 0))
rdd: org.apache.spark.rdd.RDD[(scala.collection.immutable.Map[Int,Int], Int)] = MapPartitionsRDD[20] at map at <console>:27
scala> rdd.toDF
res8: org.apache.spark.sql.DataFrame = [_1: map<int,int>, _2: int]
scala> val rdd = sc.parallelize(1 to 10).map(x => Map(x -> 0))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]] = MapPartitionsRDD[23] at map at <console>:27
scala> rdd.toDF
<console>:30: error: value toDF is not a member of org.apache.spark.rdd.RDD[scala.collection.immutable.Map[Int,Int]]
rdd.toDF
因此,在这里到底发生了什么,toDF可以将类型为(scala.collection.immutable.Map[Int,Int], Int)
的RDD转换为DataFrame而不是类型为scala.collection.immutable.Map[Int,Int]
的RDD.这是为什么?
So what exactly is happening here, toDF can convert RDD of type (scala.collection.immutable.Map[Int,Int], Int)
to DataFrame but not of type scala.collection.immutable.Map[Int,Int]
. Why is that?
推荐答案
出于同样的原因,您不能使用
For the same reason why you cannot use
sqlContext.createDataFrame(1 to 10).map(x => Map(x -> 0))
如果您查看org.apache.spark.sql.SQLContext
源,则会发现createDataFrame
方法的两种不同实现:
If you take a look at the org.apache.spark.sql.SQLContext
source you'll find two different implementations of the createDataFrame
method:
def createDataFrame[A <: Product : TypeTag](rdd: RDD[A]): DataFrame
和
def createDataFrame[A <: Product : TypeTag](data: Seq[A]): DataFrame
如您所见,两者都要求A
是Product
的子类.当您在RDD[(Map[Int,Int], Int)]
上调用toDF
时,它会起作用,因为Tuple2
确实是Product
.因此,Map[Int,Int]
本身不是错误.
As you can see both require A
to be a subclass of Product
. When you call toDF
on a RDD[(Map[Int,Int], Int)]
it works because Tuple2
is indeed a Product
. Map[Int,Int]
by itself is not hence the error.
您可以通过将Map
与Tuple1
包裹在一起来使其工作:
You can make it work by wrapping Map
with Tuple1
:
sc.parallelize(1 to 10).map(x => Tuple1(Map(x -> 0))).toDF
这篇关于为什么Spark/Scala编译器在RDD [Map [Int,Int]]上找不到toDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!