GroupBy 列并过滤 Pyspark 中具有最大值的行 [英] GroupBy column and filter rows with maximum value in Pyspark

查看:56
本文介绍了GroupBy 列并过滤 Pyspark 中具有最大值的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我几乎可以肯定以前有人问过这个问题,但是 [2] 的副本,因为我想要最大值值,不是最频繁的项目.我是 pyspark 的新手,并试图做一些非常简单的事情:我想对A"列进行分组,然后只保留B"列中具有最大值的每个组的行.像这样:

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really simple: I want to groupBy column "A" and then only keep the row of each group that has the maximum value in column "B". Like this:

df_cleaned = df.groupBy("A").agg(F.max("B"))

不幸的是,这会丢弃所有其他列 - df_cleaned 仅包含列A"和 B 的最大值.我该如何保留行?(A"、B"、C"...)

Unfortunately, this throws away all other columns - df_cleaned only contains the columns "A" and the max value of B. How do I instead keep the rows? ("A", "B", "C"...)

推荐答案

您可以使用 Window 在没有 udf 的情况下执行此操作.

You can do this without a udf using a Window.

考虑以下示例:

import pyspark.sql.functions as f
data = [
    ('a', 5),
    ('a', 8),
    ('a', 7),
    ('b', 1),
    ('b', 3)
]
df = sqlCtx.createDataFrame(data, ["A", "B"])
df.show()
#+---+---+
#|  A|  B|
#+---+---+
#|  a|  5|
#|  a|  8|
#|  a|  7|
#|  b|  1|
#|  b|  3|
#+---+---+

创建一个 Window 以按列 A 进行分区,并使用它来计算每个组的最大值.然后过滤出行,使得 B 列中的值等于最大值.

Create a Window to partition by column A and use this to compute the maximum of each group. Then filter out the rows such that the value in column B is equal to the max.

from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))
    .where(f.col('B') == f.col('maxB'))
    .drop('maxB')
    .show()
#+---+---+
#|  A|  B|
#+---+---+
#|  a|  8|
#|  b|  3|
#+---+---+

或者等效地使用pyspark-sql:

df.registerTempTable('table')
q = "SELECT A, B FROM (SELECT *, MAX(B) OVER (PARTITION BY A) AS maxB FROM table) M WHERE B = maxB"
sqlCtx.sql(q).show()
#+---+---+
#|  A|  B|
#+---+---+
#|  b|  3|
#|  a|  8|
#+---+---+

这篇关于GroupBy 列并过滤 Pyspark 中具有最大值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆