如何将RDD [行]转换回数据框 [英] How to convert an RDD[Row] back to DataFrame

查看:1896
本文介绍了如何将RDD [行]转换回数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在玩弄RDDS转换为DataFrames,然后再返回。首先,我只好叫dataPair类型的RDD(INT,INT)。然后,我用创建了列标题的数据框对象:

  VAL数据框= dataPair.toDF(报头(0),包头(1))

然后我用转换它从一个数据帧回一个RDD:

  VAL testRDD = dataFrame.rdd

返回一个RDD型org.apache.spark.sql.Row(不(INT,INT))。然后,我想用.toDF将其转换回一个RDD但我得到一个错误:

 错误:值toDF不是org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]中的一员

我试图定义的数据类型(INT,INT)为testRDD的模式,但我得到类型不匹配的异常:

 错误:类型不匹配;
发现:org.apache.spark.rdd.RDD [org.apache.spark.sql.Row]
要求:org.apache.spark.rdd.RDD [数据]
    VAL testRDD:RDD [数据] = dataFrame.rdd
                                       ^

我已导入

 进口sqlContext.implicits._


解决方案

要创建行的RDD一个数据帧,通常你有两个主要选择:

1)您可以使用 toDF()可以通过进口sqlContext.implicits进口._ 。但是,这种方法只适用于以下类型RDDS的


  • RDD [INT]

  • RDD [龙]

  • RDD [字符串]

  • RDD [T&LT ;: scala.Product]

(来源:<一个href=\"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext$implicits$\"相对=nofollow>在 SQLContext.implicits 对象Scaladoc )

最后的签名实际上意味着它可以为元组的RDD或案例类的工作RDD(因为元组和案例类是scala.Product的子类)。

因此​​,使用这种方法为 RDD [行] ,你必须把它映射到 RDD [T&LT ;:阶。产品] 。这可以通过每个行映射到定制壳体类或一元组,如下面code片段来进行:

  VAL DF = rdd.map({
  行的情况下(VAL1:字符串,...,VALN:长)=&GT; (VAL1,...,VALN)
})。toDF(col1_name,...,colN_name)

 情况下MyClass类(VAL1:字符串,...,VALN:龙= 0L)
VAL DF = rdd.map({
  行的情况下(VAL1:字符串,...,VALN:长)=&GT; MyClass的(VAL1,...,VALN)
})。toDF(col1_name,...,colN_name)

这种方法(在我看来)的主要缺点是,你必须明确地通过设置列中的地图功能所产生的数据框,列的模式。也许这可以通过程序来完成,如果你不事先知道的模式,但事情可能会变得有些凌乱那里。因此,可替代地,还有另一种选择:


2)您可以使用 createDataFrame(rowRDD:RDD [行]模式:StructType),这是可以的的 SQLContext 目的。例如:

  VAL DF = oldDF.sqlContext.createDataFrame(RDD,oldDF.schema)

请注意,没有必要明确设置任何模式列。我们重用旧的DF的架构,这是 StructType 类,并可以很容易地扩展。然而,这种方法有时是不可能的,而且在某些情况下,可以比第一个效率较低。

我希望它比以前更清晰。干杯。

I've been playing around with converting RDDs to DataFrames and back again. First, I had an RDD of type (Int, Int) called dataPair. Then I created a DataFrame object with column headers using:

val dataFrame = dataPair.toDF(header(0), header(1))

Then I converted it from a DataFrame back to an RDD using:

val testRDD = dataFrame.rdd

which returns an RDD of type org.apache.spark.sql.Row (not (Int, Int)). Then I'd like to convert it back to an RDD using .toDF but I get an error:

error: value toDF is not a member of org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]

I've tried defining a Schema of type Data(Int, Int) for testRDD, but I get type mismatch exceptions:

error: type mismatch;
found   : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[Data]
    val testRDD: RDD[Data] = dataFrame.rdd
                                       ^

I've already imported

import sqlContext.implicits._

解决方案

To create a DataFrame from an RDD of Rows, usually you have two main options:

1) You can use toDF() which can be imported by import sqlContext.implicits._. However, this approach only works for the following types of RDDs:

  • RDD[Int]
  • RDD[Long]
  • RDD[String]
  • RDD[T <: scala.Product]

(source: Scaladoc of the SQLContext.implicits object)

The last signature actually means that it can work for an RDD of tuples or an RDD of case classes (because tuples and case classes are subclasses of scala.Product).

So, to use this approach for an RDD[Row], you have to map it to an RDD[T <: scala.Product]. This can be done by mapping each row to a custom case class or to a tuple, as in the following code snippets:

val df = rdd.map({ 
  case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")

or

case class MyClass(val1: String, ..., valN: Long = 0L)
val df = rdd.map({ 
  case Row(val1: String, ..., valN: Long) => MyClass(val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")

The main drawback of this approach (in my opinion) is that you have to explicitly set the schema of the resulting DataFrame in the map function, column by column. Maybe this can be done programatically if you don't know the schema in advance, but things can get a little messy there. So, alternatively, there is another option:


2) You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContext object. Example:

val df = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)

Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

I hope it's clearer than before. Cheers.

这篇关于如何将RDD [行]转换回数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆