SAS Proc Freq with PySpark(频率、百分比、累积频率和累积百分比) [英] SAS Proc Freq with PySpark (Frequency, percent, cumulative frequency, and cumulative percent)

查看:119
本文介绍了SAS Proc Freq with PySpark(频率、百分比、累积频率和累积百分比)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种在 PySpark 中重现 SAS Proc Freq 代码的方法.我发现 这段代码正是我需要的.但是,它是在 Pandas 中给出的.我想确保它确实使用了 Spark 所能提供的最好的东西,因为代码将在大量数据集上运行.在另一个帖子中(也适用于这个 StackOverflow 答案),我还找到了在 PySpark 中计算分布式分组累积和的说明,但没有确定如何使其适应我的目的.

I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does use the best what Spark can offer, as the code will run with massive datasets. In this other post (which was also adapted for this StackOverflow answer), I also found instructions to compute distributed groupwise cumulative sums in PySpark, but not sure how to adapt it to my end.

这是一个输入和输出示例(我的原始数据集将有几十亿行):

Here's an input and output example (my original dataset will have a couple of billions rows):

输入数据集:

        state
0       Delaware
1       Delaware
2       Delaware
3       Indiana
4       Indiana
...     ...
1020    West Virginia
1021    West Virginia
1022    West Virginia
1023    West Virginia
1024    West Virginia

1025 rows × 1 columns

预期输出:

    state           Frequency   Percent Cumulative Frequency    Cumulative Percent
0   Vermont         246         24.00   246                     24.00
1   New Hampshire   237         23.12   483                     47.12
2   Missouri        115         11.22   598                     58.34
3   North Carolina  100         9.76    698                     68.10
4   Indiana         92          8.98    790                     77.07
5   Montana         56          5.46    846                     82.54
6   West Virginia   55          5.37    901                     87.90
7   North Dakota    53          5.17    954                     93.07
8   Washington      39          3.80    993                     96.88
9   Utah            29          2.83    1022                    99.71
10  Delaware        3           0.29    1025                    100.00

推荐答案

可以先按状态分组得到频率和百分比,然后在一个窗口上使用 sum 得到累积频率和百分比:

You can first group by state to get the frequency and percent, then use sum over a window to get the cumulative frequency and percent:

result = df.groupBy('state').agg(
    F.count('state').alias('Frequency')
).selectExpr(
    '*',
    '100 * Frequency / sum(Frequency) over() Percent'
).selectExpr(
    '*',
    'sum(Frequency) over(order by Frequency desc) Cumulative_Frequency', 
    'sum(Percent) over(order by Frequency desc) Cumulative_Percent'
)

result.show()
+-------------+---------+-------+--------------------+------------------+
|        state|Frequency|Percent|Cumulative_Frequency|Cumulative_Percent|
+-------------+---------+-------+--------------------+------------------+
|West Virginia|        5|   50.0|                   5|              50.0|
|     Delaware|        3|   30.0|                   8|              80.0|
|      Indiana|        2|   20.0|                  10|             100.0|
+-------------+---------+-------+--------------------+------------------+

这篇关于SAS Proc Freq with PySpark(频率、百分比、累积频率和累积百分比)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆