Spark Window函数 - rangeBetween日期 [英] Spark Window Functions - rangeBetween dates
问题描述
DataFrame
与数据,我想要得到的是在给定日期范围内的当前行之前的所有行。所以例如我想让所有的行从7天以前返回给定行。我想到,我需要使用窗口函数
,如: 窗口\
.partitionBy('id')\
.orderBy('start')
这里有问题。我想要有一个 rangeBetween
7天,但Spark文档中没有什么可以找到。 Spark甚至提供这样的选择?现在我刚刚收到以下所有行:
.rowsBetween(-sys.maxsize,0)
但希望实现以下内容:
.rangeBetween(7天,0)
如果有人可以帮助我在这一个我将非常感谢。感谢提前!
据我所知,在Spark和Hive中都不可能直接。两者都需要 ORDER BY
子句与 RANGE
一起使用为数字。我发现最接近的事情是转换为时间戳,并在几秒钟内运行。假设开始
列包含日期
类型:
$ $ $ $
row = Row(id,start,some_value)
df = sc.parallelize([
row(1,2015-01-01,20.0),
row(1,2015-01-06,10.0),
row(1,2015-01-07 ,25.0),
row(1,2015-01-12,30.0),
row(2,2015-01-01,5.0),
row ,2015-01-03,30.0),
row(2,2015-02-01,20.0)
])。toDF()。withColumn(start,col开始)。cast(date))
一个小帮手和窗口定义:
从pyspark.sql.window导入窗口
从pyspark.sql.functions import mean,col
#Hive时间戳被解释为UNIX时间戳,以秒为单位*
days = lambda i:i * 86400
最后查询:
w =(Window()
.partitionBy(col id))
.orderBy(col(start)。cast(timestamp)。cast(long))
.rangeBetween(-days(7),0))
df.select(col(*),mean(some_value)。over(w)意思是))。show()
## + --- + ---------- + ---------- + ----- ------------- +
## | id | start | some_value |意思|
## + --- + ---------- + ---------- + ----------------- - +
## | 1 | 2015-01-01 | 20.0 | 20.0 |
## | 1 | 2015-01-06 | 10.0 | 15.0 |
## | 1 | 2015-01-07 | 25.0 | 18.333333333333332 |
## | 1 | 2015-01-12 | 30.0 | 21.666666666666668 |
## | 2 | 2015-01-01 | 5.0 | 5.0 |
## | 2 | 2015-01-03 | 30.0 | 17.5 |
## | 2 | 2015-02-01 | 20.0 | 20.0 |
## + --- + ---------- + ---------- + ----------------- - +
远离漂亮但有效。
< hr>
I am having a Spark SQL DataFrame
with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function
like:
Window \
.partitionBy('id') \
.orderBy('start')
and here comes the problem. I want to have a rangeBetween
7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:
.rowsBetween(-sys.maxsize, 0)
but would like to achieve something like:
.rangeBetween("7 days", 0)
If anyone could help me on this one I'll be very grateful. Thanks in advance!
As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY
clause used with RANGE
to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start
column contains date
type:
from pyspark.sql import Row
row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))
A small helper and window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
Finally query:
w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))
df.select(col("*"), mean("some_value").over(w).alias("mean")).show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Far from pretty but works.
这篇关于Spark Window函数 - rangeBetween日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!