Spark和SparkSQL:如何模仿窗函数? [英] Spark and SparkSQL: How to imitate window function?
问题描述
给出一个数据帧东风
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
我想创建一个运行计数器或索引,
I want to create a running counter or index,
- 用相同的ID和分组
- 按日期该组中的排序,
因此,
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
这是我可以用窗函数,例如实现。
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
不幸的是,星火1.4.1不支持窗口函数定期dataframes:
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
问题
- 我怎样才能实现,而无需使用窗口功能目前星火1.4.1上述计算?
- 当将定期dataframes窗口功能星火?支持
谢谢!
推荐答案
您可以用RDDS做到这一点。我个人觉得这个API的RDDS使得很多更有意义 - 我不总是希望我的数据是平就像一个数据帧
You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
以上给出类似如下的结果:
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
如果你想'组'中的位置,以及,你可以使用 zipWithIndex
。
If you want the position within the 'group' as well, you can use zipWithIndex
.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
您的可能的扁平化这一回行
对象使用FlatMap一个简单的列表/阵列,但是如果你需要在执行任何东西'集团不会是一个好主意。
You could flatten this back to a simple List/Array of Row
objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
缺点使用RDD像这样的是,它的繁琐转换数据框中为RDD并再次返回。
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.
这篇关于Spark和SparkSQL:如何模仿窗函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!