withColumn不允许我使用max()函数生成新列 [英] withColumn not allowing me to use max() function to generate a new column
问题描述
我有一个像这样的数据集:
I have a dataset like this:
a = sc.parallelize([[1,2,3],[0,2,1],[9,8,7]]).toDF(["one", "two", "three"])
我想拥有一个数据集,该数据集会添加一个新列,该列等于其他三列中的最大值. 输出看起来像这样:
I want to have a dataset that adds a new column that is equal to the largest value in the other three columns. The output would look like this:
+----+----+-----+-------+
|one |two |three|max_col|
+----+----+-----+-------+
| 1| 2| 3| 3|
| 0| 2| 1| 2|
| 9| 8| 7| 9|
+----+----+-----+-------+
我想我会使用withColumn
,就像这样:
I thought I would use withColumn
, like so:
b = a.withColumn("max_col", max(a["one"], a["two"], a["three"]))
但这会产生错误
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark152/python/pyspark/sql/column.py", line 418, in __nonzero__
raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
奇怪. max
返回bool
吗?不符合 max
上的文档.好的.很奇怪.
Odd. Does max
return a bool
? Not according to the documentation on max
. Okay. Weird.
我觉得这很奇怪:
b = a.withColumn("max_col", a["one"] + a["two"] + a["three"]))
它起作用的事实使我更加强烈地认为max
表现出某种我不理解的方式.
And the fact that it works makes me think even more strongly that max
is behaving some way I don't understand.
我也尝试过b = a.withColumn("max_col", max([a["one"], a["two"], a["three"]]))
,它将三列作为列表而不是3个separte元素传递.这会产生与上述相同的错误.
I also tried b = a.withColumn("max_col", max([a["one"], a["two"], a["three"]]))
, which passes in the three columns as a list rather than 3 separte elements. This yields the same error as above.
推荐答案
实际上,您需要的是greatest
而不是max
:
Actually what you need here is greatest
not max
:
from pyspark.sql.functions import greatest
a.withColumn("max_col", greatest(a["one"], a["two"], a["three"]))
为了完整起见,您可以使用least
查找最小值:
And just for completeness you can use least
to find the minimum:
from pyspark.sql.functions import least
a.withColumn("min_col", least(a["one"], a["two"], a["three"]))
关于错误,您看到它非常简单. max
取决于丰富的比较.比较两列时,您会得到Column
:
Regarding the error you see it is quite simple. max
depends on the rich comparisons. When you compare two columns you get a Column
:
type(col("a") < col("b")
## pyspark.sql.column.Column
PySpark明确禁止将列转换为布尔值(您可以检查Column.__nonzero__
源代码),因为它根本没有意义.它只是一个逻辑表达式,无法在驱动程序上下文中进行评估.
PySpark explicitly forbids converting columns to booleans (you can check Column.__nonzero__
source) because it is simply meaningless. It is only a logical expression which cannot be evaluated in the driver context.
这篇关于withColumn不允许我使用max()函数生成新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!