SparkR窗函数 [英] SparkR window function
问题描述
我从 JIRA 发现1.6版本的SparkR
实施窗口功能,包括滞后
和等级
,但在
功能尚未实现。我怎么能在 SparkR $ C $使用窗函数像
滞后
功能,无需在
C>(不是 SparkSQL
办法)?有人可以提供一个例子吗?
I found from JIRA that 1.6 release of SparkR
has implemented window functions including lag
and rank
, but over
function is not implemented yet. How can I use window function like lag
function without over
in SparkR
(not the SparkSQL
way)? Can someone provide an example?
推荐答案
不幸的是这是不可能的1.6.0。虽然一些窗口功能,包括滞后
,已落实SparkR不支持窗口定义还没有这使得这些完全无用。
Unfortunately it is not possible in 1.6.0. While some window functions, including lag
, have been implemented SparkR doesn't support window definitions yet which renders these completely useless.
只要 SPARK-11395 是没有解决的唯一选择就是使用原始SQL:
As long as SPARK-11395 is not resolved the only option is to use raw SQL:
set.seed(1)
hc <- sparkRHive.init(sc)
sdf <- createDataFrame(hc, data.frame(x=1:12, y=1:3, z=rnorm(12)))
registerTempTable(sdf, "sdf")
sql(hc, "SELECT x, y, z, LAG(z) OVER (PARTITION BY y ORDER BY x) FROM sdf") %>%
head()
## x y z _c3
## 1 1 1 -0.6264538 NA
## 2 4 1 1.5952808 -0.6264538
## 3 7 1 0.4874291 1.5952808
## 4 10 1 -0.3053884 0.4874291
## 5 2 2 0.1836433 NA
## 6 5 2 0.3295078 0.1836433
假设相应的PR 将无显著变化的窗口定义,例如查询合并应如下所示:
Assuming that the corresponding PR will be merged without significant changes window definition and example query should look as follows:
w <- Window.partitionBy("y") %>% orderBy("x")
select(sdf, over(lag(sdf$z), w))
这篇关于SparkR窗函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!