具有不同列的 Pysaprk 多组分组 [英] Pysaprk multi groupby with different column
问题描述
我有如下数据
year name percent sex
1880 John 0.081541 boy
1881 William 0.080511 boy
1881 John 0.050057 boy
我需要使用不同的列进行分组和计数
I need to groupby and count using different columns
df_year = df.groupby('year').count()
df_name = df.groupby('name').count()
df_sex = df.groupby('sex').count()
然后我必须创建一个窗口来获取每列的前 3 个数据
then I have to create a Window to get the top-3 data by each column
window = Window.partitionBy('year').orderBy(col("count").desc())
top4_res = df_year.withColumn('topn', func.row_number().over(window)).
filter(col('topn') <= 4).repartition(1)
假设我有数百列要进行分组和计数以及 topk_3 操作.
suppose I have hundreds of columns to groupby and count and topk_3 operation.
我可以一次性完成吗?
或者有什么更好的方法吗?
or is there any better ways to do it?
推荐答案
我不确定这是否能满足您的要求,但如果您对单个数据框没问题,我认为它可以为您提供一个开始,如果否则.您可以堆叠这 3 列(或更多),然后分组并计算:
I am not sure if this will meet your requirement but if you are okay with a single dataframe, i think it can give you a start, let me know if otherwise. You can stack these 3 columns (or more) and then groupby and take count :
cols = ['year','name','sex']
e = f"""stack({len(cols)},{','.join(map(','.join,
(zip([f'"{i}"' for i in cols],cols))))}) as (col,val)"""
(df.select(*[F.col(i).cast('string') for i in cols]).selectExpr(e)
.groupBy(*['col','val']).agg(F.count("col").alias("Counts")).orderBy('col')).show()
+----+-------+------+
| col| val|Counts|
+----+-------+------+
|name| John| 2|
|name|William| 1|
| sex| boy| 3|
|year| 1881| 2|
|year| 1880| 1|
+----+-------+------+
如果你想要一个宽的形式,你也可以旋转,但我认为长的形式会有所帮助:
If you want a wide form you can also pivot but i think long form would be helpful:
(df.select(*[F.col(i).cast('string') for i in cols]).selectExpr(e)
.groupBy('col').pivot('val').agg(F.count('val')).show())
+----+----+----+----+-------+----+
| col|1880|1881|John|William| boy|
+----+----+----+----+-------+----+
|name|null|null| 2| 1|null|
|year| 1| 2|null| null|null|
| sex|null|null|null| null| 3|
+----+----+----+----+-------+----+
这篇关于具有不同列的 Pysaprk 多组分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!