如何计算差异的火花数据帧1山坳？ [英] how to compute diff for one col in spark dataframe?

查看：175 发布时间：2016/5/22 16:43:48 datetime apache-spark apache-spark-sql spark-dataframe

本文介绍了如何计算差异的火花数据帧1山坳？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

+-------------------+
|           Dev_time|
+-------------------+
|2015-09-18 05:00:20|
|2015-09-18 05:00:21|
|2015-09-18 05:00:22|
|2015-09-18 05:00:23|
|2015-09-18 05:00:24|
|2015-09-18 05:00:25|
|2015-09-18 05:00:26|
|2015-09-18 05:00:27|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:38|
|2015-09-18 05:00:39|
+-------------------+

有关火花的数据帧，我想计算日期时间的差异，就像在numpy.diff（阵列）

For spark's dataframe, I want to compute the diff of the datetime ,just like in numpy.diff(array)

推荐答案

一般来说，没有有效的方式来实现这一目标用星火 DataFrames 。且不说之类的东西才能在一个分布式安装变得相当棘手。从理论上讲，你可以使用滞后的功能如下：

Generally speaking there is no efficient way to achieve this using Spark DataFrames. Not to mention things like order become quite tricky in a distributed setup. Theoretically you can use lag function as follows:

from pyspark.sql.functions import lag, col, unix_timestamp
from pyspark.sql.window import Window

dev_time = (unix_timestamp(col("dev_time")) * 1000).cast("timestamp")

df = sc.parallelize([
    ("2015-09-18 05:00:20", ), ("2015-09-18 05:00:21", ),
    ("2015-09-18 05:00:22", ), ("2015-09-18 05:00:23", ),
    ("2015-09-18 05:00:24", ), ("2015-09-18 05:00:25", ),
    ("2015-09-18 05:00:26", ), ("2015-09-18 05:00:27", ),
    ("2015-09-18 05:00:37", ), ("2015-09-18 05:00:37", ),
    ("2015-09-18 05:00:37", ), ("2015-09-18 05:00:38", ),
    ("2015-09-18 05:00:39", )
]).toDF(["dev_time"]).withColumn("dev_time", dev_time)

w = Window.orderBy("dev_time")
lag_dev_time = lag("dev_time").over(w).cast("integer")

diff = df.select((col("dev_time").cast("integer") - lag_dev_time).alias("diff"))

## diff.show()
## +----+
## |diff|
## +----+
## |null|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |   1|
## |  10|
## ...

但它是非常低效的（作为窗口功能将所有的数据到一个分区，如果没有提供 PARTITION BY 条款）。在实践中，让使用更感滑动方法上RDD（斯卡拉），或实现自己的滑动窗口（蟒蛇）。参见：

but it is extremely inefficient (as for window functions move all data to a single partition if no PARTITION BY clause is provided). In practice it makes more sense to use sliding method on a RDD (Scala) or implement your own sliding window (Python). See:

如何在Pyspark滑动窗口在时间序列数据转换数据

如何基于索引访问星火元素的数组RDD

How to transform data with sliding window over time series data in Pyspark
How to access Spark RDD Array of elements based on index

这篇关于如何计算差异的火花数据帧1山坳？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何计算差异的火花数据帧1山坳？ [英] how to compute diff for one col in spark dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何计算差异的火花数据帧1山坳？ [英] how to compute diff for one col in spark dataframe?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭