如何在 PySpark 中使用 Scala UDF? [英] How to use Scala UDF in PySpark?
问题描述
我希望能够在 PySpark 中使用 Scala 函数作为 UDF
I want to be able to use a Scala function as a UDF in PySpark
package com.test
object ScalaPySparkUDFs extends Serializable {
def testFunction1(x: Int): Int = { x * 2 }
def testUDFFunction1 = udf { x: Int => testFunction1(x) }
}
我可以在 PySpark 中访问 testFunction1
并让它返回值:
I can access testFunction1
in PySpark and have it return values:
functions = sc._jvm.com.test.ScalaPySparkUDFs
functions.testFunction1(10)
我想要做的是将此函数用作 UDF,最好在 withColumn
调用中使用:
What I want to be able to do is use this function as a UDF, ideally in a withColumn
call:
row = Row("Value")
numbers = sc.parallelize([1,2,3,4]).map(row).toDF()
numbers.withColumn("Result", testUDFFunction1(numbers['Value']))
我认为这里有一个很有前途的方法:Spark:如何使用 Scala 映射 Python还是 Java 用户定义函数?
I think a promising approach is as found here: Spark: How to map Python with Scala or Java User Defined Functions?
但是,在对那里发现的代码进行更改以使用 testUDFFunction1
代替时:
However, when making the changes to code found there to use testUDFFunction1
instead:
def udf_test(col):
sc = SparkContext._active_spark_context
_f = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction1.apply
return Column(_f(_to_seq(sc, [col], _to_java_column)))
我得到:
AttributeError: 'JavaMember' object has no attribute 'apply'
我不明白这是因为我相信 testUDFFunction1
确实有一个 apply 方法?
I don't understand this because I believe testUDFFunction1
does have an apply method?
我不想使用此处找到的类型的表达式:从Scala向SqlContext注册UDF以在PySpark中使用一个>
I do not want to use expressions of the type found here: Register UDF to SqlContext from Scala to use in PySpark
对于如何进行这项工作的任何建议将不胜感激!
Any suggestions as to how to make this work would be appreciated!
推荐答案
您链接的问题是使用 Scala object
.Scala object
是一个单例,你可以直接使用 apply
方法.
The question you've linked is using a Scala object
. Scala object
is a singleton and you can use apply
method directly.
这里你使用一个 nullary 函数,它返回一个 UserDefinedFunction
类的对象,你必须先调用这个函数:
Here you use a nullary function which returns an object of UserDefinedFunction
class co you have to call the function first:
_f = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction1() # Note () at the end
Column(_f.apply(_to_seq(sc, [col], _to_java_column)))
这篇关于如何在 PySpark 中使用 Scala UDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!