在Spark中创建合并的直方图 [英] Creating binned histograms in Spark

查看:122
本文介绍了在Spark中创建合并的直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个数据框(df)(熊猫)或RDD(火花),其中包含以下两列:

Suppose I have a dataframe (df) (Pandas) or RDD (Spark) with the following two columns:

timestamp, data
12345.0    10 
12346.0    12

在Pandas中,我可以很容易地创建具有不同bin长度的binned直方图.例如,要在1小时内创建直方图,请执行以下操作:

In Pandas, I can create a binned histogram of different bin lengths pretty easily. For example, to create a histogram over 1 hr, I do the following:

df =  df[ ['timestamp', 'data'] ].set_index('timestamp')
df.resample('1H',how=sum).dropna()

从Spark RDD迁移到Pandas df对我来说非常昂贵(考虑数据集).因此,我宁愿尽可能地留在Spark域中.

Moving to Pandas df from Spark RDD is pretty expensive for me (considering the dataset). Consequently, I prefer to stay within the Spark domain as much as possible.

是否有办法在Spark RDD或数据帧中进行等效操作?

Is there a way to do the equivalent in Spark RDD or dataframes?

推荐答案

在这种特殊情况下,您需要的只是Unix时间戳和基本算术:

In this particular case all you need is Unix timestamps and basic arithmetics:

def resample_to_minute(c, interval=1):
    t = 60 * interval
    return (floor(c / t) * t).cast("timestamp")

def resample_to_hour(c, interval=1):
    return resample_to_minute(c, 60 * interval)

df = sc.parallelize([
    ("2000-01-01 00:00:00", 0), ("2000-01-01 00:01:00", 1),
    ("2000-01-01 00:02:00", 2), ("2000-01-01 00:03:00", 3),
    ("2000-01-01 00:04:00", 4), ("2000-01-01 00:05:00", 5),
    ("2000-01-01 00:06:00", 6), ("2000-01-01 00:07:00", 7),
    ("2000-01-01 00:08:00", 8)
]).toDF(["timestamp", "data"])

(df.groupBy(resample_to_minute(unix_timestamp("timestamp"), 3).alias("ts"))
    .sum().orderBy("ts").show(3, False))

## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|3        |
## |2000-01-01 00:03:00.0|12       |
## |2000-01-01 00:06:00.0|21       |
## +---------------------+---------+

(df.groupBy(resample_to_hour(unix_timestamp("timestamp")).alias("ts"))
    .sum().orderBy("ts").show(3, False))
## +---------------------+---------+
## |ts                   |sum(data)|
## +---------------------+---------+
## |2000-01-01 00:00:00.0|36       |
## +---------------------+---------+

来自 文档.

通常情况下,请参见使用Spark DataFrame列制作直方图

这篇关于在Spark中创建合并的直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆