星火窗口函数 - rangeBetween日期 [英] Spark Window Functions - rangeBetween dates
问题描述
我有一个星火SQL 数据帧
数据和我想要得到的是所有行$ P $在给定的日期范围内pceding当前行。例如,我想都从7给出天回preceding行的行。我想通了,我需要使用窗口功能
这样的:
I am having a Spark SQL DataFrame
with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function
like:
Window \
.partitionBy('id') \
.orderBy('start')
和来这里的问题。我想有一个 rangeBetween
7天,但并没有什么在Spark文档我能找到这一点。是否星火甚至提供这样的选择?现在我刚开始所有的preceding行:
and here comes the problem. I want to have a rangeBetween
7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:
.rowsBetween(-sys.maxsize, 0)
但想实现这样的:
but would like to achieve something like:
.rangeBetween("7 days", 0)
如果有人可以帮助我在这件,我会非常感激。在此先感谢!
If anyone could help me on this one I'll be very grateful. Thanks in advance!
推荐答案
据我知道这是不可能的直接无论是在星火也不蜂巢。这都需要 ORDER BY
子句范围
曾经是数字。我发现的最接近的是转换为时间戳和秒工作。假设启动
列包含日期
键入
As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY
clause used with RANGE
to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start
column contains date
type:
from pyspark.sql import Row
row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))
一个小帮手和窗口定义:
A small helper and window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
最后查询:
w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))
df.select(col("*"), mean("some_value").over(w).alias("mean")).show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
远pretty但作品。
Far from pretty but works.
* <一个href=\"https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp\"相对=nofollow>蜂巢语言手册,类型
这篇关于星火窗口函数 - rangeBetween日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!