如何使用 sqlContext 计算累积总和 [英] How to calculate cumulative sum using sqlContext

查看:28
本文介绍了如何使用 sqlContext 计算累积总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道我们可以使用 Windowpyspark 中的函数 来计算累积和.但是 Window 仅在 HiveContext 中受支持,在 SQLContext 中不支持.我需要使用 SQLContext,因为 HiveContext 不能在多进程中运行.

I know we can use Window function in pyspark to calculate cumulative sum. But Window is only supported in HiveContext and not in SQLContext. I need to use SQLContext as HiveContext cannot be run in multi processes.

是否有使用 SQLContext 计算累积总和的有效方法?一个简单的方法是将数据加载到驱动程序的内存中并使用numpy.cumsum,但缺点是数据需要能够装入内存

Is there any efficient way to calculate cumulative sum using SQLContext? A simple way is to load the data into the driver's memory and use numpy.cumsum, but the con is the data need to be able to fit into the memory

推荐答案

不确定这是否是您要查找的内容,但这里有两个示例如何使用 sqlContext 计算累积总和:

Not sure if this is what you are looking for but here are two examples how to use sqlContext to calculate the cumulative sum:

首先,当您想按某些类别对其进行分区时:

First when you want to partition it by some categories:

from pyspark.sql.types import StructType, StringType, LongType
from pyspark.sql import SQLContext

rdd = sc.parallelize([
    ("Tablet", 6500), 
    ("Tablet", 5500), 
    ("Cell Phone", 6000), 
    ("Cell Phone", 6500), 
    ("Cell Phone", 5500)
    ])

schema = StructType([
    StructField("category", StringType(), False),
    StructField("revenue", LongType(), False)
    ])

df = sqlContext.createDataFrame(rdd, schema)

df.registerTempTable("test_table")

df2 = sqlContext.sql("""
SELECT
    category,
    revenue,
    sum(revenue) OVER (PARTITION BY category ORDER BY revenue) as cumsum
FROM
test_table
""")

输出:

[Row(category='Tablet', revenue=5500, cumsum=5500),
 Row(category='Tablet', revenue=6500, cumsum=12000),
 Row(category='Cell Phone', revenue=5500, cumsum=5500),
 Row(category='Cell Phone', revenue=6000, cumsum=11500),
 Row(category='Cell Phone', revenue=6500, cumsum=18000)]

第二种,当您只想获取一个变量的总和时.将 df2 更改为:

Second when you only want to take the cumsum of one variable. Change df2 to this:

df2 = sqlContext.sql("""
SELECT
    category,
    revenue,
    sum(revenue) OVER (ORDER BY revenue, category) as cumsum
FROM
test_table
""")

输出:

[Row(category='Cell Phone', revenue=5500, cumsum=5500),
 Row(category='Tablet', revenue=5500, cumsum=11000),
 Row(category='Cell Phone', revenue=6000, cumsum=17000),
 Row(category='Cell Phone', revenue=6500, cumsum=23500),
 Row(category='Tablet', revenue=6500, cumsum=30000)]

希望这会有所帮助.收集数据后使用 np.cumsum 效率不高,尤其是在数据集很大的情况下.您可以探索的另一种方法是使用简单的 RDD 转换,例如 groupByKey(),然后使用 map 通过某个键计算每个组的累积总和,然后在最后减少它.

Hope this helps. Using np.cumsum is not very efficient after collecting the data especially if the dataset is large. Another way you could explore is to use simple RDD transformations like groupByKey() and then use map to calculate the cumulative sum of each group by some key and then reduce it at the end.

这篇关于如何使用 sqlContext 计算累积总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆