如何计算火花数据帧中一个列的差异? [英] how to compute diff for one col in spark dataframe?

查看:37
本文介绍了如何计算火花数据帧中一个列的差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

+-------------------+
|           Dev_time|
+-------------------+
|2015-09-18 05:00:20|
|2015-09-18 05:00:21|
|2015-09-18 05:00:22|
|2015-09-18 05:00:23|
|2015-09-18 05:00:24|
|2015-09-18 05:00:25|
|2015-09-18 05:00:26|
|2015-09-18 05:00:27|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:38|
|2015-09-18 05:00:39|
+-------------------+

对于 spark 的数据帧,我想计算日期时间的差异,就像在 numpy.diff(array)

For spark's dataframe, I want to compute the diff of the datetime ,just like in numpy.diff(array)

推荐答案

一般来说,使用 Spark DataFrames 没有有效的方法来实现这一点.更不用说像订单这样的事情在分布式设置中变得非常棘手.理论上你可以使用 lag 函数如下:

Generally speaking there is no efficient way to achieve this using Spark DataFrames. Not to mention things like order become quite tricky in a distributed setup. Theoretically you can use lag function as follows:

from pyspark.sql.functions import lag, col, unix_timestamp
from pyspark.sql.window import Window

dev_time = (unix_timestamp(col("dev_time")) * 1000).cast("timestamp")

df = sc.parallelize([
    ("2015-09-18 05:00:20", ), ("2015-09-18 05:00:21", ),
    ("2015-09-18 05:00:22", ), ("2015-09-18 05:00:23", ),
    ("2015-09-18 05:00:24", ), ("2015-09-18 05:00:25", ),
    ("2015-09-18 05:00:26", ), ("2015-09-18 05:00:27", ),
    ("2015-09-18 05:00:37", ), ("2015-09-18 05:00:37", ),
    ("2015-09-18 05:00:37", ), ("2015-09-18 05:00:38", ),
    ("2015-09-18 05:00:39", )
]).toDF(["dev_time"]).withColumn("dev_time", dev_time)

w = Window.orderBy("dev_time")
lag_dev_time = lag("dev_time").over(w).cast("integer")

diff = df.select((col("dev_time").cast("integer") - lag_dev_time).alias("diff"))

## diff.show()
## +----+
## |diff|
## +----+
## |null|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |  10|
## ...

但效率极低(对于窗口函数,如果没有提供 PARTITION BY 子句,则将所有数据移动到单个分区).在实践中,在 RDD (Scala) 上使用 sliding 方法或实现自己的滑动窗口 (Python) 更有意义.见:

but it is extremely inefficient (as for window functions move all data to a single partition if no PARTITION BY clause is provided). In practice it makes more sense to use sliding method on a RDD (Scala) or implement your own sliding window (Python). See:

这篇关于如何计算火花数据帧中一个列的差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆