Spark Window函数 - rangeBetween日期 [英] Spark Window Functions - rangeBetween dates

查看:1562
本文介绍了Spark Window函数 - rangeBetween日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Spark SQL DataFrame 与数据,我想要得到的是在给定日期范围内的当前行之前的所有行。所以例如我想让所有的行从7天以前返回给定行。我想到,我需要使用窗口函数,如:

 窗口\ 
.partitionBy('id')\
.orderBy('start')

这里有问题。我想要有一个 rangeBetween 7天,但Spark文档中没有什么可以找到。 Spark甚至提供这样的选择?现在我刚刚收到以下所有行:

  .rowsBetween(-sys.maxsize,0)

但希望实现以下内容:

  .rangeBetween(7天,0)

如果有人可以帮助我在这一个我将非常感谢。感谢提前!

解决方案

据我所知,在Spark和Hive中都不可能直接。两者都需要 ORDER BY 子句与 RANGE 一起使用为数字。我发现最接近的事情是转换为时间戳,并在几秒钟内运行。假设开始列包含日期类型:

 $ $ $ $ 

row = Row(id,start,some_value)
df = sc.parallelize([
row(1,2015-01-01,20.0),
row(1,2015-01-06,10.0),
row(1,2015-01-07 ,25.0),
row(1,2015-01-12,30.0),
row(2,2015-01-01,5.0),
row ,2015-01-03,30.0),
row(2,2015-02-01,20.0)
])。toDF()。withColumn(start,col开始)。cast(date))

一个小帮手和窗口定义:

 从pyspark.sql.window导入窗口
从pyspark.sql.functions import mean,col


#Hive时间戳被解释为UNIX时间戳,以秒为单位*
days = lambda i:i * 86400

最后查询:

  w =(Window()
.partitionBy(col id))
.orderBy(col(start)。cast(timestamp)。cast(long))
.rangeBetween(-days(7),0))

df.select(col(*),mean(some_value)。over(w)意思是))。show()

## + --- + ---------- + ---------- + ----- ------------- +
## | id | start | some_value |意思|
## + --- + ---------- + ---------- + ----------------- - +
## | 1 | 2015-01-01 | 20.0 | 20.0 |
## | 1 | 2015-01-06 | 10.0 | 15.0 |
## | 1 | 2015-01-07 | 25.0 | 18.333333333333332 |
## | 1 | 2015-01-12 | 30.0 | 21.666666666666668 |
## | 2 | 2015-01-01 | 5.0 | 5.0 |
## | 2 | 2015-01-03 | 30.0 | 17.5 |
## | 2 | 2015-02-01 | 20.0 | 20.0 |
## + --- + ---------- + ---------- + ----------------- - +

远离漂亮但有效。



< hr>

* Hive语言手册,类型


I am having a Spark SQL DataFrame with data and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out I need to use a Window Function like:

Window \
    .partitionBy('id') \
    .orderBy('start')

and here comes the problem. I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:

.rowsBetween(-sys.maxsize, 0)

but would like to achieve something like:

.rangeBetween("7 days", 0)

If anyone could help me on this one I'll be very grateful. Thanks in advance!

解决方案

As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:

from pyspark.sql import Row

row = Row("id", "start", "some_value")
df = sc.parallelize([
    row(1, "2015-01-01", 20.0),
    row(1, "2015-01-06", 10.0),
    row(1, "2015-01-07", 25.0),
    row(1, "2015-01-12", 30.0),
    row(2, "2015-01-01", 5.0),
    row(2, "2015-01-03", 30.0),
    row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))

A small helper and window definition:

from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col


# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400 

Finally query:

w = (Window()
   .partitionBy(col("id"))
   .orderBy(col("start").cast("timestamp").cast("long"))
   .rangeBetween(-days(7), 0))

df.select(col("*"), mean("some_value").over(w).alias("mean")).show()

## +---+----------+----------+------------------+
## | id|     start|some_value|              mean|
## +---+----------+----------+------------------+
## |  1|2015-01-01|      20.0|              20.0|
## |  1|2015-01-06|      10.0|              15.0|
## |  1|2015-01-07|      25.0|18.333333333333332|
## |  1|2015-01-12|      30.0|21.666666666666668|
## |  2|2015-01-01|       5.0|               5.0|
## |  2|2015-01-03|      30.0|              17.5|
## |  2|2015-02-01|      20.0|              20.0|
## +---+----------+----------+------------------+

Far from pretty but works.


* Hive Language Manual, Types

这篇关于Spark Window函数 - rangeBetween日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆