斯卡拉& Spark:一次投放多列 [英] Scala & Spark: Cast multiple columns at once

查看:83
本文介绍了斯卡拉& Spark:一次投放多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于 VectorAssembler 崩溃,如果传递的列具有除NumericTypeBooleanType之外的任何其他类型,并且我正在处理很多TimestampType列,我想知道:

Since the VectorAssembler is crashing, if a passed column has any other type than NumericType or BooleanType and I'm dealing with a lot of TimestampType columns, I want to know:

有一种简单的方法可以一次投射多列吗?

Is there a easy way, to cast multiple columns at once?

基于此答案,我已经有了一种方便的方法来投射单个列:

Based on this answer I already have a convenient way to cast a single column:

def castColumnTo(df: DataFrame, 
    columnName: String, 
    targetType: DataType ) : DataFrame = {
      df.withColumn( columnName, df(columnName).cast(targetType) )
}

我考虑过递归调用castColumnTo,但是我强烈怀疑这是(性能)方法.

I thought about calling castColumnTo recursively, but I strongly doubt that that's the (performant) way to go.

推荐答案

基于注释(谢谢!),我想到了以下代码(未实现错误处理):

Based on the comments (thanks!) I came up with the following code (no error handling implemented):

def castAllTypedColumnsTo(df: DataFrame, 
   sourceType: DataType, targetType: DataType) : DataFrame = {

      val columnsToBeCasted = df.schema
         .filter(s => s.dataType == sourceType)

      //if(columnsToBeCasted.length > 0) {
      //   println(s"Found ${columnsToBeCasted.length} columns " +
      //      s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
      //      s" - casting to ${targetType.typeName.capitalize}Type")
      //}

      columnsToBeCasted.foldLeft(df){(foldedDf, col) => 
         castColumnTo(foldedDf, col.name, LongType)}
}

感谢鼓舞人心的评论. foldLeft(在此处

Thanks for the inspiring comments. foldLeft (explained here and here) saves a for loop to iterate over a var dataframe.

这篇关于斯卡拉& Spark:一次投放多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆