如何将常量值传递给 Python UDF? [英] How to pass a constant value to Python UDF?

查看:26
本文介绍了如何将常量值传递给 Python UDF?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在想是否有可能创建一个 UDF 接收两个参数一个 Column 和另一个变量 (Object,字典,或任何其他类型),然后执行一些操作并返回结果.

I was thinking if it was possible to create an UDF that receives two arguments a Column and another variable (Object,Dictionary, or any other type), then do some operations and return the result.

实际上,我试图这样做,但我得到了一个例外.所以,我想知道有没有什么办法可以避免这个问题.

Actually, I attempted to do this but I got an exception. Therefore, I was wondering if there was any way to avoid this problem.

df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00), 
                                 ("Hayek", 60, 3000.00), 
                                 ("Mises", 60, 1000.0)], 
                                ["name", "age", "balance"])

comparatorUDF = udf(lambda c, n: c == n, BooleanType())

df.where(comparatorUDF(col("name"), "Bonsanto")).show()

我收到以下错误:

AnalysisException: u"cannot resolve 'Bonsanto' 给定的输入列姓名、年龄、余额;"

AnalysisException: u"cannot resolve 'Bonsanto' given input columns name, age, balance;"

所以很明显UDF看到"stringBonsanto"作为列名,实际上我试图将记录值与第二个进行比较论证.

So it's obvious that the UDF "sees" the string "Bonsanto" as a column name, and actually I'm trying to compare a record value with the second argument.

另一方面,我知道可以在 where 子句中使用一些运算符(但实际上我想知道它是否可以使用 UDF 实现),如下:

On the other hand, I know that it's possible to use some operators inside a where clause (but actually I want to know if it is achievable using an UDF), as follows:

df.where(col("name") == "Bonsanto").show()

#+--------+---+-------+
#|    name|age|balance|
#+--------+---+-------+
#|Bonsanto| 20| 2000.0|
#+--------+---+-------+

推荐答案

传递给 UDF 的所有内容都被解释为列/列名称.如果你想传递一个文字,你有两个选择:

Everything that is passed to an UDF is interpreted as a column / column name. If you want to pass a literal you have two options:

  1. 使用柯里化传递参数:

  1. Pass argument using currying:

def comparatorUDF(n):
    return udf(lambda c: c == n, BooleanType())

df.where(comparatorUDF("Bonsanto")(col("name")))

这可以与任何类型的参数一起使用,只要它是可序列化的.

This can be used with an argument of any type as long as it is serializable.

使用 SQL 文字和当前实现:

Use a SQL literal and the current implementation:

from pyspark.sql.functions import lit

df.where(comparatorUDF(col("name"), lit("Bonsanto")))

这仅适用于支持的类型(字符串、数字、布尔值).对于非原子类型,请参阅如何在 Spark DataFrame 中添加常量列?

This works only with supported types (strings, numerics, booleans). For non-atomic types see How to add a constant column in a Spark DataFrame?

这篇关于如何将常量值传递给 Python UDF?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆