withColumn 不允许我使用 max() 函数生成新列 [英] withColumn not allowing me to use max() function to generate a new column

查看:30
本文介绍了withColumn 不允许我使用 max() 函数生成新列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的数据集:

I have a dataset like this:

a = sc.parallelize([[1,2,3],[0,2,1],[9,8,7]]).toDF(["one", "two", "three"])

我想要一个数据集,添加一个新列,该列等于其他三列中的最大值.输出将如下所示:

I want to have a dataset that adds a new column that is equal to the largest value in the other three columns. The output would look like this:

+----+----+-----+-------+
|one |two |three|max_col|
+----+----+-----+-------+
|   1|   2|    3|      3|
|   0|   2|    1|      2|
|   9|   8|    7|      9|
+----+----+-----+-------+

我想我会使用 withColumn,就像这样:

I thought I would use withColumn, like so:

b = a.withColumn("max_col", max(a["one"], a["two"], a["three"]))

但这会产生错误

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark152/python/pyspark/sql/column.py", line 418, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

奇怪.max 是否返回 bool ?不是根据 关于 max 的文档.好的.奇怪.

Odd. Does max return a bool? Not according to the documentation on max. Okay. Weird.

我觉得这很奇怪:

b = a.withColumn("max_col", a["one"] + a["two"] + a["three"]))

它有效的事实让我更加强烈地认为 max 的行为方式我不理解.

And the fact that it works makes me think even more strongly that max is behaving some way I don't understand.

我也试过 b = a.withColumn("max_col", max([a["one"], a["two"], a["three"]])),它将三列作为列表而不是 3 个单独的元素传入.这会产生与上述相同的错误.

I also tried b = a.withColumn("max_col", max([a["one"], a["two"], a["three"]])), which passes in the three columns as a list rather than 3 separte elements. This yields the same error as above.

推荐答案

实际上你在这里需要的是greatest 而不是max:

Actually what you need here is greatest not max:

from pyspark.sql.functions import greatest

a.withColumn("max_col", greatest(a["one"], a["two"], a["three"]))

为了完整起见,您可以使用 least 来找到最小值:

And just for completeness you can use least to find the minimum:

from pyspark.sql.functions import least

a.withColumn("min_col", least(a["one"], a["two"], a["three"]))

关于你看到的错误很简单.max 取决于丰富的比较.当你比较两列时,你会得到一个 Column:

Regarding the error you see it is quite simple. max depends on the rich comparisons. When you compare two columns you get a Column:

type(col("a") < col("b")
## pyspark.sql.column.Column

PySpark 明确禁止将列转换为布尔值(您可以查看 Column.__nonzero__ 源代码),因为它根本没有意义.它只是一个无法在驱动程序上下文中计算的逻辑表达式.

PySpark explicitly forbids converting columns to booleans (you can check Column.__nonzero__ source) because it is simply meaningless. It is only a logical expression which cannot be evaluated in the driver context.

这篇关于withColumn 不允许我使用 max() 函数生成新列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆