Spark Window函数 - rangeBetween日期 [英] Spark Window Functions - rangeBetween dates

查看：1562 发布时间：2017/4/6 20:05:48 python date apache-spark pyspark window-functions

本文介绍了Spark Window函数 - rangeBetween日期的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Spark SQL DataFrame 与数据，我想要得到的是在给定日期范围内的当前行之前的所有行。所以例如我想让所有的行从7天以前返回给定行。我想到，我需要使用窗口函数，如：

 窗口\ 
 .partitionBy（'id'）\ 
 .orderBy（'start'）

这里有问题。我想要有一个 rangeBetween 7天，但Spark文档中没有什么可以找到。 Spark甚至提供这样的选择？现在我刚刚收到以下所有行：

  .rowsBetween（-sys.maxsize，0）

但希望实现以下内容：

  .rangeBetween（7天，0）

如果有人可以帮助我在这一个我将非常感谢。感谢提前！

解决方案

据我所知，在Spark和Hive中都不可能直接。两者都需要 ORDER BY 子句与 RANGE 一起使用为数字。我发现最接近的事情是转换为时间戳，并在几秒钟内运行。假设开始列包含日期类型：

 $ $ $ $ 
 
 row = Row（id，start，some_value）
 df = sc.parallelize（[
 row（1，2015-01-01，20.0），
 row（1，2015-01-06，10.0），
 row（1，2015-01-07 ，25.0），
 row（1，2015-01-12，30.0），
 row（2，2015-01-01，5.0），
 row ，2015-01-03，30.0），
 row（2，2015-02-01，20.0）
]）。toDF（）。withColumn（start，col开始）。cast（date））

一个小帮手和窗口定义：

 从pyspark.sql.window导入窗口
从pyspark.sql.functions import mean，col 
 
 
＃Hive时间戳被解释为UNIX时间戳，以秒为单位* 
 days = lambda i：i * 86400

最后查询：

  w =（Window（）
 .partitionBy（col id））
 .orderBy（col（start）。cast（timestamp）。cast（long）） 
 .rangeBetween（-days（7），0））
 
 df.select（col（*），mean（some_value）。over（w）意思是））。show（）
 
 ## + --- + ---------- + ---------- + ----- ------------- + 
 ## | id | start | some_value |意思| 
 ## + --- + ---------- + ---------- + ----------------- -  + 
 ## | 1 | 2015-01-01 | 20.0 | 20.0 | 
 ## | 1 | 2015-01-06 | 10.0 | 15.0 | 
 ## | 1 | 2015-01-07 | 25.0 | 18.333333333333332 | 
 ## | 1 | 2015-01-12 | 30.0 | 21.666666666666668 | 
 ## | 2 | 2015-01-01 | 5.0 | 5.0 | 
 ## | 2 | 2015-01-03 | 30.0 | 17.5 | 
 ## | 2 | 2015-02-01 | 20.0 | 20.0 | 
 ## + --- + ---------- + ---------- + ----------------- -  +

远离漂亮但有效。

< hr>

* Hive语言手册，类型

I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like:

Window \
    .partitionBy('id') \
    .orderBy('start')

and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:

.rowsBetween(-sys.maxsize, 0)

but would like to achieve something like:

.rangeBetween("7 days", 0)

If anyone could help me on this one I'll be very grateful. Thanks in advance!

解决方案

As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:

from pyspark.sql import Row

row = Row("id", "start", "some_value")
df = sc.parallelize([
    row(1, "2015-01-01", 20.0),
    row(1, "2015-01-06", 10.0),
    row(1, "2015-01-07", 25.0),
    row(1, "2015-01-12", 30.0),
    row(2, "2015-01-01", 5.0),
    row(2, "2015-01-03", 30.0),
    row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))

A small helper and window definition:

from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col


# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400

Finally query:

w = (Window()
   .partitionBy(col("id"))
   .orderBy(col("start").cast("timestamp").cast("long"))
   .rangeBetween(-days(7), 0))

df.select(col("*"), mean("some_value").over(w).alias("mean")).show()

## +---+----------+----------+------------------+
## | id|     start|some_value|              mean|
## +---+----------+----------+------------------+
## |  1|2015-01-01|      20.0|              20.0|
## |  1|2015-01-06|      10.0|              15.0|
## |  1|2015-01-07|      25.0|18.333333333333332|
## |  1|2015-01-12|      30.0|21.666666666666668|
## |  2|2015-01-01|       5.0|               5.0|
## |  2|2015-01-03|      30.0|              17.5|
## |  2|2015-02-01|      20.0|              20.0|
## +---+----------+----------+------------------+

Far from pretty but works.

* Hive Language Manual, Types

这篇关于Spark Window函数 - rangeBetween日期的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark Window函数 - rangeBetween日期 [英] Spark Window Functions - rangeBetween dates

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark Window函数 - rangeBetween日期 [英] Spark Window Functions - rangeBetween dates

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭