星火窗口函数 - rangeBetween日期 [英] Spark Window Functions - rangeBetween dates

查看:2266
本文介绍了星火窗口函数 - rangeBetween日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个星火SQL 数据帧数据和我想要得到的是所有行$ P $在给定的日期范围内pceding当前行。例如,我想都从7给出天回preceding行的行。我想通了,我需要使用窗口功能这样的:

I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like:

Window \
    .partitionBy('id') \
    .orderBy('start')

和来这里的问题。我想有一个 rangeBetween 7天,但并没有什么在Spark文档我能找到这一点。是否星火甚至提供这样的选择?现在我刚开始所有的preceding行:

and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:

.rowsBetween(-sys.maxsize, 0)

但想实现这样的:

but would like to achieve something like:

.rangeBetween("7 days", 0)

如果有人可以帮助我在这件,我会非常感激。在此先感谢!

If anyone could help me on this one I'll be very grateful. Thanks in advance!

推荐答案

据我知道这是不可能的直接无论是在星火也不蜂巢。这都需要 ORDER BY 子句范围曾经是数字。我发现的最接近的是转换为时间戳和秒工作。假设启动列包含日期键入

As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:

from pyspark.sql import Row

row = Row("id", "start", "some_value")
df = sc.parallelize([
    row(1, "2015-01-01", 20.0),
    row(1, "2015-01-06", 10.0),
    row(1, "2015-01-07", 25.0),
    row(1, "2015-01-12", 30.0),
    row(2, "2015-01-01", 5.0),
    row(2, "2015-01-03", 30.0),
    row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))

一个小帮手和窗口定义:

A small helper and window definition:

from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col


# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400 

最后查询:

w = (Window()
   .partitionBy(col("id"))
   .orderBy(col("start").cast("timestamp").cast("long"))
   .rangeBetween(-days(7), 0))

df.select(col("*"), mean("some_value").over(w).alias("mean")).show()

## +---+----------+----------+------------------+
## | id|     start|some_value|              mean|
## +---+----------+----------+------------------+
## |  1|2015-01-01|      20.0|              20.0|
## |  1|2015-01-06|      10.0|              15.0|
## |  1|2015-01-07|      25.0|18.333333333333332|
## |  1|2015-01-12|      30.0|21.666666666666668|
## |  2|2015-01-01|       5.0|               5.0|
## |  2|2015-01-03|      30.0|              17.5|
## |  2|2015-02-01|      20.0|              20.0|
## +---+----------+----------+------------------+

远pretty但作品。

Far from pretty but works.

* <一个href=\"https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp\"相对=nofollow>蜂巢语言手册,类型

这篇关于星火窗口函数 - rangeBetween日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆