如何将常量值传递给Python UDF? [英] How to pass a constant value to Python UDF?
问题描述
我在想,是否有可能创建一个接收两个自变量Column
和另一个变量(Object
,Dictionary
或任何其他类型)的UDF
,然后执行一些操作并返回结果
I was thinking if it was possible to create an UDF
that receives two arguments a Column
and another variable (Object
,Dictionary
, or any other type), then do some operations and return the result.
实际上,我尝试执行此操作,但出现了异常.因此,我想知道是否有任何方法可以避免此问题.
Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
comparatorUDF = udf(lambda c, n: c == n, BooleanType())
df.where(comparatorUDF(col("name"), "Bonsanto")).show()
然后出现以下错误:
AnalysisException:u"无法解析给定输入列的"Bonsanto" 名称,年龄,余额;"
AnalysisException: u"cannot resolve 'Bonsanto' given input columns name, age, balance;"
因此,很明显,UDF
将"string
""Bonsanto"视为列名称,实际上,我正在尝试将记录值与第二个参数进行比较.
So it's obvious that the UDF
"sees" the string
"Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.
另一方面,我知道可以在where
子句中使用某些运算符(但实际上我想知道是否可以使用UDF
来实现),如下所示:
On the other hand, I know that it's possible to use some operators inside a where
clause (but actually I want to know if it is achievable using an UDF
), as follows:
df.where(col("name") == "Bonsanto").show()
#+--------+---+-------+
#| name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+
推荐答案
传递给UDF的所有内容都被解释为列/列名称.如果要传递文字,则有两种选择:
Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:
-
使用currying传递参数:
Pass argument using currying:
def comparatorUDF(n):
return udf(lambda c: c == n, BooleanType())
df.where(comparatorUDF("Bonsanto")(col("name")))
此参数可以与任何类型的参数一起使用,只要它可以序列化即可.
This can be used with an argument of any type as long as it is serializable.
使用SQL文字和当前实现:
Use a SQL literal and the current implementation:
from pyspark.sql.functions import lit
df.where(comparatorUDF(col("name"), lit("Bonsanto")))
这仅适用于受支持的类型(字符串,数字,布尔值).对于非原子类型,请参见如何在Spark DataFrame中添加常量列?
This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?
这篇关于如何将常量值传递给Python UDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!