比较 Pyspark 中的列 [英] Comparing columns in Pyspark

查看：31 发布时间：2021/11/12 5:41:35 python apache-spark pyspark

本文介绍了比较 Pyspark 中的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在处理具有 n 列的 PySpark DataFrame.我有一组 m 列(m

I am working on a PySpark DataFrame with n columns. I have a set of m columns (m < n) and my task is choose the column with max values in it.

例如:

输入:PySpark DataFrame 包含:

Input: PySpark DataFrame containing :

col_1 = [1,2,3], col_2 = [2,1,4], col_3 = [3,2,5]

输出:

col_4 = max(col1, col_2, col_3) = [3,2,5]

如这个问题中所述，熊猫中有类似的东西.

There is something similar in pandas as explained in this question.

在 PySpark 中是否有任何方法可以做到这一点，或者我是否应该更改将 PySpark df 转换为 Pandas df 然后执行操作?

Is there any way of doing this in PySpark or should I change convert my PySpark df to Pandas df and then perform the operations?

推荐答案

您可以减少对列列表使用 SQL 表达式:

You can reduce using SQL expressions over a list of columns:

from pyspark.sql.functions import max as max_, col, when
from functools import reduce

def row_max(*cols):
    return reduce(
        lambda x, y: when(x > y, x).otherwise(y),
        [col(c) if isinstance(c, str) else c for c in cols]
    )

df = (sc.parallelize([(1, 2, 3), (2, 1, 2), (3, 4, 5)])
    .toDF(["a", "b", "c"]))

df.select(row_max("a", "b", "c").alias("max")))

Spark 1.5+ 还提供了least、greatest

Spark 1.5+ also provides least, greatest

from pyspark.sql.functions import greatest

df.select(greatest("a", "b", "c"))

如果你想保留最大值的名称，你可以使用 `structs:

If you want to keep name of the max you can use `structs:

from pyspark.sql.functions import struct, lit

def row_max_with_name(*cols):
    cols_ = [struct(col(c).alias("value"), lit(c).alias("col")) for c in cols]
    return greatest(*cols_).alias("greatest({0})".format(",".join(cols)))

 maxs = df.select(row_max_with_name("a", "b", "c").alias("maxs"))

最后你可以使用上面的来查找选择顶部"列:

And finally you can use above to find select "top" column:

from pyspark.sql.functions import max

((_, c), ) = (maxs
    .groupBy(col("maxs")["col"].alias("col"))
    .count()
    .agg(max(struct(col("count"), col("col"))))
    .first())

df.select(c)

这篇关于比较 Pyspark 中的列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

比较 Pyspark 中的列 [英] Comparing columns in Pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

比较 Pyspark 中的列 [英] Comparing columns in Pyspark

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭