如何计算差异的火花数据帧1山坳? [英] how to compute diff for one col in spark dataframe?
问题描述
+-------------------+
| Dev_time|
+-------------------+
|2015-09-18 05:00:20|
|2015-09-18 05:00:21|
|2015-09-18 05:00:22|
|2015-09-18 05:00:23|
|2015-09-18 05:00:24|
|2015-09-18 05:00:25|
|2015-09-18 05:00:26|
|2015-09-18 05:00:27|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:37|
|2015-09-18 05:00:38|
|2015-09-18 05:00:39|
+-------------------+
有关火花的数据帧,我想计算日期时间的差异,就像在numpy.diff(阵列)
For spark's dataframe, I want to compute the diff of the datetime ,just like in numpy.diff(array)
推荐答案
一般来说,没有有效的方式来实现这一目标用星火 DataFrames
。且不说之类的东西才能在一个分布式安装变得相当棘手。从理论上讲,你可以使用滞后
的功能如下:
Generally speaking there is no efficient way to achieve this using Spark DataFrames
. Not to mention things like order become quite tricky in a distributed setup. Theoretically you can use lag
function as follows:
from pyspark.sql.functions import lag, col, unix_timestamp
from pyspark.sql.window import Window
dev_time = (unix_timestamp(col("dev_time")) * 1000).cast("timestamp")
df = sc.parallelize([
("2015-09-18 05:00:20", ), ("2015-09-18 05:00:21", ),
("2015-09-18 05:00:22", ), ("2015-09-18 05:00:23", ),
("2015-09-18 05:00:24", ), ("2015-09-18 05:00:25", ),
("2015-09-18 05:00:26", ), ("2015-09-18 05:00:27", ),
("2015-09-18 05:00:37", ), ("2015-09-18 05:00:37", ),
("2015-09-18 05:00:37", ), ("2015-09-18 05:00:38", ),
("2015-09-18 05:00:39", )
]).toDF(["dev_time"]).withColumn("dev_time", dev_time)
w = Window.orderBy("dev_time")
lag_dev_time = lag("dev_time").over(w).cast("integer")
diff = df.select((col("dev_time").cast("integer") - lag_dev_time).alias("diff"))
## diff.show()
## +----+
## |diff|
## +----+
## |null|
## | 1|
## | 1|
## | 1|
## | 1|
## | 1|
## | 1|
## | 1|
## | 10|
## ...
但它是非常低效的(作为窗口功能将所有的数据到一个分区,如果没有提供 PARTITION BY
条款)。在实践中,让使用更感滑动
方法上RDD(斯卡拉),或实现自己的滑动窗口(蟒蛇)。参见:
but it is extremely inefficient (as for window functions move all data to a single partition if no PARTITION BY
clause is provided). In practice it makes more sense to use sliding
method on a RDD (Scala) or implement your own sliding window (Python). See:
- How to transform data with sliding window over time series data in Pyspark
- How to access Spark RDD Array of elements based on index
这篇关于如何计算差异的火花数据帧1山坳?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!