SparkR窗函数 [英] SparkR window function

查看:176
本文介绍了SparkR窗函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从 JIRA 发现1.6版本的SparkR 实施窗口功能,包括滞后等级,但功能尚未实现。我怎么能在 SparkR 滞后功能,无需 C>(不是 SparkSQL 办法)?有人可以提供一个例子吗?

I found from JIRA that 1.6 release of SparkR has implemented window functions including lag and rank, but over function is not implemented yet. How can I use window function like lag function without over in SparkR(not the SparkSQL way)? Can someone provide an example?

推荐答案

不幸的是这是不可能的1.6.0。虽然一些窗口功能,包括滞后,已落实SparkR不支持窗口定义还没有这使得这些完全无用。

Unfortunately it is not possible in 1.6.0. While some window functions, including lag, have been implemented SparkR doesn't support window definitions yet which renders these completely useless.

只要 SPARK-11395 是没有解决的唯一选择就是使用原始SQL:

As long as SPARK-11395 is not resolved the only option is to use raw SQL:

set.seed(1)

hc <- sparkRHive.init(sc)
sdf <- createDataFrame(hc, data.frame(x=1:12, y=1:3, z=rnorm(12)))
registerTempTable(sdf, "sdf")

sql(hc, "SELECT x, y, z, LAG(z) OVER (PARTITION BY y ORDER BY x) FROM sdf") %>% 
  head()

##    x y          z        _c3
## 1  1 1 -0.6264538         NA
## 2  4 1  1.5952808 -0.6264538
## 3  7 1  0.4874291  1.5952808
## 4 10 1 -0.3053884  0.4874291
## 5  2 2  0.1836433         NA
## 6  5 2  0.3295078  0.1836433

假设相应的PR 将无显著变化的窗口定义,例如查询合并应如下所示:

Assuming that the corresponding PR will be merged without significant changes window definition and example query should look as follows:

w <- Window.partitionBy("y") %>% orderBy("x")
select(sdf, over(lag(sdf$z), w))

这篇关于SparkR窗函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆