SAS Proc Freq with PySpark(频率、百分比、累积频率和累积百分比) [英] SAS Proc Freq with PySpark (Frequency, percent, cumulative frequency, and cumulative percent)
问题描述
我正在寻找一种在 PySpark 中重现 SAS Proc Freq 代码的方法.我发现 这段代码正是我需要的.但是,它是在 Pandas 中给出的.我想确保它确实使用了 Spark 所能提供的最好的东西,因为代码将在大量数据集上运行.在另一个帖子中(也适用于这个 StackOverflow 答案),我还找到了在 PySpark 中计算分布式分组累积和的说明,但没有确定如何使其适应我的目的.
I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does use the best what Spark can offer, as the code will run with massive datasets. In this other post (which was also adapted for this StackOverflow answer), I also found instructions to compute distributed groupwise cumulative sums in PySpark, but not sure how to adapt it to my end.
这是一个输入和输出示例(我的原始数据集将有几十亿行):
Here's an input and output example (my original dataset will have a couple of billions rows):
输入数据集:
state
0 Delaware
1 Delaware
2 Delaware
3 Indiana
4 Indiana
... ...
1020 West Virginia
1021 West Virginia
1022 West Virginia
1023 West Virginia
1024 West Virginia
1025 rows × 1 columns
预期输出:
state Frequency Percent Cumulative Frequency Cumulative Percent
0 Vermont 246 24.00 246 24.00
1 New Hampshire 237 23.12 483 47.12
2 Missouri 115 11.22 598 58.34
3 North Carolina 100 9.76 698 68.10
4 Indiana 92 8.98 790 77.07
5 Montana 56 5.46 846 82.54
6 West Virginia 55 5.37 901 87.90
7 North Dakota 53 5.17 954 93.07
8 Washington 39 3.80 993 96.88
9 Utah 29 2.83 1022 99.71
10 Delaware 3 0.29 1025 100.00
推荐答案
可以先按状态分组得到频率和百分比,然后在一个窗口上使用 sum
得到累积频率和百分比:
You can first group by state to get the frequency and percent, then use sum
over a window to get the cumulative frequency and percent:
result = df.groupBy('state').agg(
F.count('state').alias('Frequency')
).selectExpr(
'*',
'100 * Frequency / sum(Frequency) over() Percent'
).selectExpr(
'*',
'sum(Frequency) over(order by Frequency desc) Cumulative_Frequency',
'sum(Percent) over(order by Frequency desc) Cumulative_Percent'
)
result.show()
+-------------+---------+-------+--------------------+------------------+
| state|Frequency|Percent|Cumulative_Frequency|Cumulative_Percent|
+-------------+---------+-------+--------------------+------------------+
|West Virginia| 5| 50.0| 5| 50.0|
| Delaware| 3| 30.0| 8| 80.0|
| Indiana| 2| 20.0| 10| 100.0|
+-------------+---------+-------+--------------------+------------------+
这篇关于SAS Proc Freq with PySpark(频率、百分比、累积频率和累积百分比)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!