星火斯卡拉2.10元组的限制 [英] Spark Scala 2.10 tuple limit

查看:178
本文介绍了星火斯卡拉2.10元组的限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有数据帧66列的处理(需好歹改变几乎每个列的值),所以我运行下面的语句

I have DataFrame with 66 columns to process (almost each column value needs to be changed someway) so I'm running following statement

    val result = data.map(row=> (
        modify(row.getString(row.fieldIndex("XX"))),
        (...)
        )
    )

第66届为止列。
由于斯卡拉在这个版本有限制的22双最大的元组,我不能像执行此。
问题是,是否有任何解决方法吗?
所有线路运营后,我将其转换带有具体列名的df,

till 66th column. Since scala in this version has limit to max tuple of 22 pairs I cannot perform this like that. Question is, is there any workaround for it? After all line operations I'm converting it to df with specific column names

   result.toDf("c1",...,"c66")
   result.storeAsTempTable("someFancyResult")

修改功能仅仅是一个例子来说明我的观点

"modify" function is just an example to show my point

推荐答案

如果你做的是修改值的现有数据帧这是更好地使用UDF代替通过RDD映射:

If all you do is modifying values from an existing DataFrame it is better to use an UDF instead of mapping over a RDD:

import org.apache.spark.sql.functions.udf

val modifyUdf = udf(modify)
data.withColumn("c1", modifyUdf($"c1"))

如果由于某种原因,上述方法不适合你的需求,你可以做最简单的事情是重新创建数据帧 RDD [行] 。例如像这样的:

If for some reason above doesn't fit your needs the simplest thing you can do is to recreateDataFrame from a RDD[Row]. for example like this:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, IntegerType}


val result: RDD[Row] = data.map(row => {
  val buffer = ArrayBuffer.empty[Any]

  // Add value to buffer
  buffer.append(modify(row.getAs[String]("c1")))

  // ... repeat for other values

  // Build row
  Row.fromSeq(buffer)
})

// Create schema
val schema = StructType(Seq(
  StructField("c1", StringType, false),
  // ...  
  StructField("c66", StringType, false)
))

sqlContext.createDataFrame(result, schema)

这篇关于星火斯卡拉2.10元组的限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆