比较 Pyspark 中的列 [英] Comparing columns in Pyspark
问题描述
我正在处理具有 n 列的 PySpark DataFrame.我有一组 m 列(m
I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
例如:
输入:PySpark DataFrame 包含:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
输出:
col_4 = max(col1, col_2, col_3) = [3,2,5]
如这个问题中所述,熊猫中有类似的东西.
There is something similar in pandas as explained in this question.
在 PySpark 中是否有任何方法可以做到这一点,或者我是否应该更改将 PySpark df 转换为 Pandas df 然后执行操作?
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?
推荐答案
您可以减少对列列表使用 SQL 表达式:
You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+ 还提供了least
、greatest
Spark 1.5+ also provides least
, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
如果你想保留最大值的名称,你可以使用 `structs:
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
最后你可以使用上面的来查找选择顶部"列:
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)
这篇关于比较 Pyspark 中的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!