比较Pyspark中的列 [英] Comparing columns in Pyspark
问题描述
我正在使用n列的PySpark DataFrame.我有一组m列(m< n),我的任务是选择其中包含最大值的列.
I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.
例如:
输入:PySpark DataFrame包含:
Input: PySpark DataFrame containing :
col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]
输出:
col_4 = max(col1, col_2, col_3) = [3,2,5]
在熊猫中有类似的内容,如此问题所述.
There is something similar in pandas as explained in this question.
在PySpark中有什么方法可以执行此操作,还是应该将PySpark df转换为Pandas df,然后执行操作?
Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?
推荐答案
您可以减少在列列表上使用SQL表达式:
You can reduce using SQL expressions over a list of columns:
from pyspark.sql.functions import max as max_, col, when
from functools import reduce
def row_max(*cols):
return reduce(
lambda x, y: when(x > y, x).otherwise(y),
[col(c) if isinstance(c, str) else c for c in cols]
)
df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
.toDF(["a", "b", "c"]))
df.select(row_max("a", "b", "c").alias("max")))
Spark 1.5+还提供了least
,greatest
Spark 1.5+ also provides least
, greatest
from pyspark.sql.functions import greatest
df.select(greatest("a", "b", "c"))
如果要保留最大值,可以使用`structs:
If you want to keep name of the max you can use `structs:
from pyspark.sql.functions import struct, lit
def row_max_with_name(*cols):
cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))
maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))
最后,您可以使用上方的内容来选择顶部"列:
And finally you can use above to find select "top" column:
from pyspark.sql.functions import max
((_, c), ) = (maxs
.groupBy(col("maxs")["col"].alias("col"))
.count()
.agg(max(struct(col("count"), col("col"))))
.first())
df.select(c)
这篇关于比较Pyspark中的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!