在 group 和 agg 之后显示所有 pyspark 列 [英] Show all pyspark columns after group and agg

查看:106
本文介绍了在 group 和 agg 之后显示所有 pyspark 列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望对一列进行分组,然后找到另一列的最大值.最后,显示基于此条件的所有列.但是,当我使用我的代码时,它只显示 2 列而不是全部.

# pyspark中创建dataframe的正常方式sdataframe_temp = spark.createDataFrame([(2,2,'0-2'),(2,23,'22-24')],['a', 'b', 'c'])sdataframe_temp2 = spark.createDataFrame([(4,6,'4-6'),(5,7,'6-8')],['a', 'b', 'c'])# 连接两个不同的pyspark数据帧sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})sdataframe_union_1_2_g.show()

输出:

+---+------+|a|最大(b)|+---+------+|5|7||2|23||4|6|+---+------+

预期输出:

+---+------+-----+|a|最大(b)|| |+---+------+-----+|5|7|6-8 ||2|23|22-24||4|6|4-6 |+---+------+---+

解决方案

您可以使用 Window function 使其工作:

<块引用>

方法一:使用Window函数

import pyspark.sql.functions as F从 pyspark.sql.window 导入窗口w = Window().partitionBy("a").orderBy(F.desc("b"))(sdataframe_union_1_2.withColumn('max_val', F.row_number().over(w) == 1).where("max_val == True").drop("max_val").表演())+---+---+-----+|一个|乙|| |+---+---+-----+|5|7|6-8||2|23|22-24||4|6|4-6|+---+---+-----+

说明

  1. Window 函数在我们想要将新列附加到现有列集时很有用.
  2. 在这种情况下,我告诉 Window 函数按 partitionBy('a') 列分组并按降序对 b 列进行排序 <代码>F.desc(b).这使得每个组中 b 中的第一个值成为最大值.
  3. 然后我们使用F.row_number()来过滤行号等于1的最大值.
  4. 最后,我们删除了新列,因为它在过滤数据框后未被使用.

<块引用>

方法二:使用groupby+inner join

f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()+---+---+-----+|一个|乙|| |+---+---+-----+|2|23|22-24||5|7|6-8||4|6|4-6|+---+---+-----+

I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.

# Normal way of creating dataframe in pyspark
sdataframe_temp = spark.createDataFrame([
    (2,2,'0-2'),
    (2,23,'22-24')],
    ['a', 'b', 'c']
)

sdataframe_temp2 = spark.createDataFrame([
    (4,6,'4-6'),
    (5,7,'6-8')],
    ['a', 'b', 'c']
)
# Concat two different pyspark dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)

sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})

sdataframe_union_1_2_g.show()

output:

+---+------+
|  a|max(b)|
+---+------+
|  5|     7|
|  2|    23|
|  4|     6|
+---+------+

Expected output:

+---+------+-----+
|  a|max(b)| c   |
+---+------+-----+
|  5|     7|6-8  |
|  2|    23|22-24|
|  4|     6|4-6  |
+---+------+---+

解决方案

You can use a Window function to make it work:

Method 1: Using Window function

import pyspark.sql.functions as F
from pyspark.sql.window import Window

w = Window().partitionBy("a").orderBy(F.desc("b"))

(sdataframe_union_1_2
.withColumn('max_val', F.row_number().over(w) == 1)
.where("max_val == True")
.drop("max_val")
.show())

+---+---+-----+
|  a|  b|    c|
+---+---+-----+
|  5|  7|  6-8|
|  2| 23|22-24|
|  4|  6|  4-6|
+---+---+-----+

Explanation

  1. Window functions are useful when we want to attach a new column to the existing set of columns.
  2. In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
  3. Then we use F.row_number() to filter the max values where row number equals 1.
  4. Finally, we drop the new column since it is not being used after filtering the data frame.

Method 2: Using groupby + inner join

f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))

sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()

+---+---+-----+
|  a|  b|    c|
+---+---+-----+
|  2| 23|22-24|
|  5|  7|  6-8|
|  4|  6|  4-6|
+---+---+-----+

这篇关于在 group 和 agg 之后显示所有 pyspark 列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆