如何将常量值传递给Python UDF? [英] How to pass a constant value to Python UDF?

查看:121
本文介绍了如何将常量值传递给Python UDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在想,是否有可能创建一个接收两个自变量Column和另一个变量(ObjectDictionary或任何其他类型)的UDF,然后执行一些操作并返回结果

I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.

实际上,我尝试执行此操作,但出现了异常.因此,我想知道是否有任何方法可以避免此问题.

Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.

df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00), 
                                 ("Hayek", 60, 3000.00), 
                                 ("Mises", 60, 1000.0)], 
                                ["name", "age", "balance"])

comparatorUDF = udf(lambda c, n: c == n, BooleanType())

df.where(comparatorUDF(col("name"), "Bonsanto")).show()

然后出现以下错误:

AnalysisException:u"无法解析给定输入列的"Bonsanto" 名称,年龄,余额;"

AnalysisException: u"cannot resolve 'Bonsanto' given input columns name, age, balance;"

因此,很明显,UDF将"string""Bonsanto"视为列名称,实际上,我正在尝试将记录值与第二个参数进行比较.

So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.

另一方面,我知道可以在where子句中使用某些运算符(但实际上我想知道是否可以使用UDF来实现),如下所示:

On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:

df.where(col("name") == "Bonsanto").show()

#+--------+---+-------+
#|    name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+

推荐答案

传递给UDF的所有内容都被解释为列/列名称.如果要传递文字,则有两种选择:

Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:

  1. 使用currying传递参数:

  1. Pass argument using currying:

def comparatorUDF(n):
    return udf(lambda c: c == n, BooleanType())

df.where(comparatorUDF("Bonsanto")(col("name")))

此参数可以与任何类型的参数一起使用,只要它可以序列化即可.

This can be used with an argument of any type as long as it is serializable.

使用SQL文字和当前实现:

Use a SQL literal and the current implementation:

from pyspark.sql.functions import lit

df.where(comparatorUDF(col("name"), lit("Bonsanto")))

这仅适用于受支持的类型(字符串,数字,布尔值).对于非原子类型,请参见如何在Spark DataFrame中添加常量列?

This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?

这篇关于如何将常量值传递给Python UDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆