迭代 PySpark GroupedData [英] Iterating over PySpark GroupedData

查看:32
本文介绍了迭代 PySpark GroupedData的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设原始数据如下:

Competitor  Region  ProductA  ProductB
Comp1       A       £10       £15
Comp1       B       £11       £16
Comp1       C       £11       £15
Comp2       A       £9        £16
Comp2       B       £12       £14
Comp2       C       £14       £17
Comp3       A       £11       £16
Comp3       B       £10       £15
Comp3       C       £12       £15

(参考:Python - 拆分数据框根据列值放入多个数据框并用这些值命名它们)

我希望获得基于列值的子数据框列表,比如区域,例如:

I wish to get list of sub dataframes based on column values, say Region, like:

df_A :

Competitor  Region  ProductA  ProductB
Comp1       A       £10       £15
Comp2       A       £9        £16
Comp3       A       £11       £16

在 Python 中我可以做到:

In Python I could do:

for region, df_region in df.groupby('Region'):
    print(df_region)

如果 df 是 Pyspark df,我可以做同样的迭代吗?

Can I do same iteration if the df is Pyspark df?

在 Pyspark 中,一旦我执行 df.groupBy("Region"),我就会得到 GroupedData.我不需要像计数、平均值等任何聚合.我只需要子数据框列表,每个子数据框都有相同的区域"值.可能吗?

In Pyspark, once I do df.groupBy("Region") I get GroupedData. I dont need any aggregation like count, mean, etc. I just need list of sub dataframes, each have same "Region" value. Possible?

推荐答案

假设分组列中的唯一值列表足够小以适合驱动程序的内存,以下方法应该适合您.希望这会有所帮助!

The approach below should work for you, under the assumption that the list of unique values in the grouping column is small enough to fit in memory on the driver. Hope this helps!

import pyspark.sql.functions as F
import pandas as pd

# Sample data 
df = pd.DataFrame({'region': ['aa','aa','aa','bb','bb','cc'],
                   'x2': [6,5,4,3,2,1],
                   'x3': [1,2,3,4,5,6]})
df = spark.createDataFrame(df)

# Get unique values in the grouping column
groups = [x[0] for x in df.select("region").distinct().collect()]

# Create a filtered DataFrame for each group in a list comprehension
groups_list = [df.filter(F.col('region')==x) for x in groups]

# show the results
[x.show() for x in groups_list]

结果:

+------+---+---+
|region| x2| x3|
+------+---+---+
|    cc|  1|  6|
+------+---+---+

+------+---+---+
|region| x2| x3|
+------+---+---+
|    bb|  3|  4|
|    bb|  2|  5|
+------+---+---+

+------+---+---+
|region| x2| x3|
+------+---+---+
|    aa|  6|  1|
|    aa|  5|  2|
|    aa|  4|  3|
+------+---+---+

这篇关于迭代 PySpark GroupedData的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆