斯卡拉& Spark:一次投放多列 [英] Scala & Spark: Cast multiple columns at once
问题描述
由于 VectorAssembler
崩溃,如果传递的列具有除NumericType
或BooleanType
之外的任何其他类型,并且我正在处理很多TimestampType
列,我想知道:
Since the VectorAssembler
is crashing, if a passed column has any other type than NumericType
or BooleanType
and I'm dealing with a lot of TimestampType
columns, I want to know:
有一种简单的方法可以一次投射多列吗?
Is there a easy way, to cast multiple columns at once?
基于此答案,我已经有了一种方便的方法来投射单个列:
Based on this answer I already have a convenient way to cast a single column:
def castColumnTo(df: DataFrame,
columnName: String,
targetType: DataType ) : DataFrame = {
df.withColumn( columnName, df(columnName).cast(targetType) )
}
我考虑过递归调用castColumnTo
,但是我强烈怀疑这是(性能)方法.
I thought about calling castColumnTo
recursively, but I strongly doubt that that's the (performant) way to go.
推荐答案
基于注释(谢谢!),我想到了以下代码(未实现错误处理):
Based on the comments (thanks!) I came up with the following code (no error handling implemented):
def castAllTypedColumnsTo(df: DataFrame,
sourceType: DataType, targetType: DataType) : DataFrame = {
val columnsToBeCasted = df.schema
.filter(s => s.dataType == sourceType)
//if(columnsToBeCasted.length > 0) {
// println(s"Found ${columnsToBeCasted.length} columns " +
// s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
// s" - casting to ${targetType.typeName.capitalize}Type")
//}
columnsToBeCasted.foldLeft(df){(foldedDf, col) =>
castColumnTo(foldedDf, col.name, LongType)}
}
感谢鼓舞人心的评论. foldLeft
(在此处和
Thanks for the inspiring comments. foldLeft
(explained here and here) saves a for
loop to iterate over a var
dataframe.
这篇关于斯卡拉& Spark:一次投放多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!