在 group 和 agg 之后显示所有 pyspark 列 [英] Show all pyspark columns after group and agg
本文介绍了在 group 和 agg 之后显示所有 pyspark 列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我希望对一列进行分组,然后找到另一列的最大值.最后,显示基于此条件的所有列.但是,当我使用我的代码时,它只显示 2 列而不是全部.
# pyspark中创建dataframe的正常方式sdataframe_temp = spark.createDataFrame([(2,2,'0-2'),(2,23,'22-24')],['a', 'b', 'c'])sdataframe_temp2 = spark.createDataFrame([(4,6,'4-6'),(5,7,'6-8')],['a', 'b', 'c'])# 连接两个不同的pyspark数据帧sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})sdataframe_union_1_2_g.show()
输出:
+---+------+|a|最大(b)|+---+------+|5|7||2|23||4|6|+---+------+
预期输出:
+---+------+-----+|a|最大(b)|| |+---+------+-----+|5|7|6-8 ||2|23|22-24||4|6|4-6 |+---+------+---+
解决方案
您可以使用 Window function
使其工作:
方法一:使用Window函数
import pyspark.sql.functions as F从 pyspark.sql.window 导入窗口w = Window().partitionBy("a").orderBy(F.desc("b"))(sdataframe_union_1_2.withColumn('max_val', F.row_number().over(w) == 1).where("max_val == True").drop("max_val").表演())+---+---+-----+|一个|乙|| |+---+---+-----+|5|7|6-8||2|23|22-24||4|6|4-6|+---+---+-----+
说明
Window
函数在我们想要将新列附加到现有列集时很有用.- 在这种情况下,我告诉
Window
函数按partitionBy('a')
列分组并按降序对b
列进行排序 <代码>F.desc(b).这使得每个组中b
中的第一个值成为最大值. - 然后我们使用
F.row_number()
来过滤行号等于1的最大值. - 最后,我们删除了新列,因为它在过滤数据框后未被使用.
<块引用>
方法二:使用groupby+inner join
f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()+---+---+-----+|一个|乙|| |+---+---+-----+|2|23|22-24||5|7|6-8||4|6|4-6|+---+---+-----+
I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.
# Normal way of creating dataframe in pyspark
sdataframe_temp = spark.createDataFrame([
(2,2,'0-2'),
(2,23,'22-24')],
['a', 'b', 'c']
)
sdataframe_temp2 = spark.createDataFrame([
(4,6,'4-6'),
(5,7,'6-8')],
['a', 'b', 'c']
)
# Concat two different pyspark dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})
sdataframe_union_1_2_g.show()
output:
+---+------+
| a|max(b)|
+---+------+
| 5| 7|
| 2| 23|
| 4| 6|
+---+------+
Expected output:
+---+------+-----+
| a|max(b)| c |
+---+------+-----+
| 5| 7|6-8 |
| 2| 23|22-24|
| 4| 6|4-6 |
+---+------+---+
解决方案
You can use a Window function
to make it work:
Method 1: Using Window function
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("a").orderBy(F.desc("b"))
(sdataframe_union_1_2
.withColumn('max_val', F.row_number().over(w) == 1)
.where("max_val == True")
.drop("max_val")
.show())
+---+---+-----+
| a| b| c|
+---+---+-----+
| 5| 7| 6-8|
| 2| 23|22-24|
| 4| 6| 4-6|
+---+---+-----+
Explanation
Window
functions are useful when we want to attach a new column to the existing set of columns.- In this case, I tell
Window
function to groupbypartitionBy('a')
column and sort the columnb
in descending orderF.desc(b)
. This make the first value inb
in each group its max value. - Then we use
F.row_number()
to filter the max values where row number equals 1. - Finally, we drop the new column since it is not being used after filtering the data frame.
Method 2: Using groupby + inner join
f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))
sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()
+---+---+-----+
| a| b| c|
+---+---+-----+
| 2| 23|22-24|
| 5| 7| 6-8|
| 4| 6| 4-6|
+---+---+-----+
这篇关于在 group 和 agg 之后显示所有 pyspark 列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文