在 Pyspark SQL 中你需要在哪里使用 lit()? [英] Where do you need to use lit() in Pyspark SQL?
问题描述
我试图弄清楚您需要在何处使用 lit
值,该值在文档中被定义为 literal 列
.
I'm trying to make sense of where you need to use a lit
value, which is defined as a literal column
in the documentation.
以这个udf
为例,它返回一个SQL列数组的索引:
Take for example this udf
, which returns the index of a SQL column array:
def find_index(column, index):
return column[index]
如果我将一个整数传递给这个,我会得到一个错误.我需要将 lit(n)
值传递给 udf 以获得数组的正确索引.
If I were to pass an integer into this I would get an error. I would need to pass a lit(n)
value into the udf to get the correct index of an array.
有没有什么地方可以让我更好地了解何时使用 lit
以及可能使用 col
的硬性规则?
Is there a place I can better learn the hard and fast rules of when to use lit
and possibly col
as well?
推荐答案
为了简单起见,您需要一个 Column
(可以是使用 lit
创建的,但它是不是唯一的选择)当 JVM 对应物需要一个列并且 Python 包装器中没有内部转换或者您想调用 Column
特定方法时.
To keep it simple you need a Column
(can be a one created using lit
but it is not the only option) when JVM counterpart expects a column and there is no internal conversion in a Python wrapper or you wan to call a Column
specific method.
在第一种情况下,唯一严格的规则是适用于 UDF 的 on.UDF(Python 或 JVM)只能使用 Column
类型的参数调用.它通常也适用于 pyspark.sql.functions
中的函数.在其他情况下,最好先检查文档和文档字符串,如果没有足够的 Scala 对应文档.
In the first case the only strict rule is the on that applies to UDFs. UDF (Python or JVM) can be called only with arguments which are of Column
type. It also typically applies to functions from pyspark.sql.functions
. In other cases it is always best to check documentation and docs string firsts and if it is not sufficient docs of a corresponding Scala counterpart.
在第二种情况下,规则很简单.例如,如果您想将列与值进行比较,则值必须位于 RHS 上:
In the second case rules are simple. If you for example want to compare a column to a value then value has to be on the RHS:
col("foo") > 0 # OK
或值必须用文字包裹:
lit(0) < col("foo") # OK
在 Python 中,许多运算符(<
、==
、<=
、&
、|
, +
, -
, *
, /
) 可以使用非列对象LHS:
In Python many operators (<
, ==
, <=
, &
, |
, +
, -
, *
, /
) can use non column object on the LHS:
0 < col("foo")
但Scala 不支持此类应用程序.
but such applications are not supported in Scala.
不言而喻,如果你想访问 任何 pyspark.sql.Column
方法 将标准 Python 标量视为 常量列.例如,您将需要
It goes without saying that you have to use lit
if you want to access any of the pyspark.sql.Column
methods treating standard Python scalar as a constant column. For example you'll need
c = lit(1)
不是
c = 1
到
c.between(0, 3) # type: pyspark.sql.Column
这篇关于在 Pyspark SQL 中你需要在哪里使用 lit()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!