在 Spark SQL 中将数组作为 UDF 参数传递 [英] Pass array as an UDF parameter in Spark SQL

查看:47
本文介绍了在 Spark SQL 中将数组作为 UDF 参数传递的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过一个将数组作为参数的函数来转换数据帧.我的代码如下所示:

I'm trying to transform a dataframe via a function that takes an array as a parameter. My code looks something like this:

def getCategory(categories:Array[String], input:String): String = { 
  categories(input.toInt) 
} 

val myArray = Array("a", "b", "c") 

val myCategories =udf(getCategory _ ) 

val df = sqlContext.parquetFile("myfile.parquet) 

val df1 = df.withColumn("newCategory", myCategories(lit(myArray), col("myInput")) 

然而,lit 不喜欢数组和这个脚本错误.我尝试定义一个新的部分应用函数,然后定义 udf :

However, lit doesn't like arrays and this script errors. I tried definining a new partially applied function and then the udf after that :

val newFunc = getCategory(myArray,  _:String) 
val myCategories = udf(newFunc) 

val df1 = df.withColumn("newCategory", myCategories(col("myInput"))) 

这也不起作用,因为我收到了 nullPointer 异常,而且似乎 myArray 未被识别.关于如何将数组作为参数传递给具有数据框的函数的任何想法?

This doesn't work either as I get a nullPointer exception and it appears myArray is not being recognized. Any ideas on how I pass an array as a parameter to a function with a dataframe?

另外,有没有解释一下为什么在数据帧上使用函数这样简单的事情如此复杂(定义函数,将其重新定义为 UDF 等)?

On a separate note, any explanation as to why doing something simple like using a function on a dataframe is so complicated (define function, redefine it as UDF, etc, etc)?

推荐答案

很可能不是最漂亮的解决方案,但您可以尝试以下方法:

Most likely not the prettiest solution but you can try something like this:

def getCategory(categories: Array[String]) = {
    udf((input:String) => categories(input.toInt))
}

df.withColumn("newCategory", getCategory(myArray)(col("myInput")))

您也可以尝试使用 array 文字:

You could also try an array of literals:

val getCategory = udf(
   (input:String, categories: Array[String]) => categories(input.toInt))

df.withColumn(
  "newCategory", getCategory($"myInput", array(myArray.map(lit(_)): _*)))

在旁注中使用 Map 而不是 Array 可能是一个更好的主意:

On a side note using Map instead of Array is probably a better idea:

def mapCategory(categories: Map[String, String], default: String) = {
    udf((input:String) =>  categories.getOrElse(input, default))
}

val myMap = Map[String, String]("1" -> "a", "2" -> "b", "3" -> "c")

df.withColumn("newCategory", mapCategory(myMap, "foo")(col("myInput")))

从 Spark 1.5.0 开始,您还可以使用 array 函数:

Since Spark 1.5.0 you can also use an array function:

import org.apache.spark.sql.functions.array

val colArray = array(myArray map(lit  _): _*)
myCategories(lit(colArray), col("myInput"))

另见使用可变参数触发 UDF

这篇关于在 Spark SQL 中将数组作为 UDF 参数传递的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆