Spark和SparkSQL：如何模仿窗函数？ [英] Spark and SparkSQL: How to imitate window function?

查看：649 发布时间：2016/5/22 15:15:04 scala apache-spark apache-spark-sql window-functions

本文介绍了Spark和SparkSQL：如何模仿窗函数？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给出一个数据帧东风

id |       date
---------------
 1 | 2015-09-01
 2 | 2015-09-01
 1 | 2015-09-03
 1 | 2015-09-04
 2 | 2015-09-04

我想创建一个运行计数器或索引，

I want to create a running counter or index,

用相同的ID和分组

按日期该组中的排序，

因此，

id |       date |  counter
--------------------------
 1 | 2015-09-01 |        1
 1 | 2015-09-03 |        2
 1 | 2015-09-04 |        3
 2 | 2015-09-01 |        1
 2 | 2015-09-04 |        2

这是我可以用窗函数，例如实现。

This is something I can achieve with window function, e.g.

val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )

不幸的是，星火1.4.1不支持窗口函数定期dataframes：

Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:

org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;

问题

我怎样才能实现，而无需使用窗口功能目前星火1.4.1上述计算？

当将定期dataframes窗口功能星火？支持

谢谢！

推荐答案

您可以用RDDS做到这一点。我个人觉得这个API的RDDS使得很多更有意义 - 我不总是希望我的数据是平就像一个数据帧

You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.

val df = sqlContext.sql("select 1, '2015-09-01'"
    ).unionAll(sqlContext.sql("select 2, '2015-09-01'")
    ).unionAll(sqlContext.sql("select 1, '2015-09-03'")
    ).unionAll(sqlContext.sql("select 1, '2015-09-04'")
    ).unionAll(sqlContext.sql("select 2, '2015-09-04'"))

// dataframe as an RDD (of Row objects)
df.rdd 
  // grouping by the first column of the row
  .groupBy(r => r(0)) 
  // map each group - an Iterable[Row] - to a list and sort by the second column
  .map(g => g._2.toList.sortBy(row => row(1).toString))     
  .collect()

以上给出类似如下的结果：

The above gives a result like the following:

Array[List[org.apache.spark.sql.Row]] = 
Array(
  List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]), 
  List([2,2015-09-01], [2,2015-09-04]))

如果你想'组'中的位置，以及，你可以使用 zipWithIndex 。

If you want the position within the 'group' as well, you can use zipWithIndex.

df.rdd.groupBy(r => r(0)).map(g => 
    g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()

Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
  List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
  List(([2,2015-09-01],0), ([2,2015-09-04],1)))

您的可能的扁平化这一回行对象使用FlatMap一个简单的列表/阵列，但是如果你需要在执行任何东西'集团不会是一个好主意。

You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.

缺点使用RDD像这样的是，它的繁琐转换数据框中为RDD并再次返回。

The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.

这篇关于Spark和SparkSQL：如何模仿窗函数？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark和SparkSQL：如何模仿窗函数？ [英] Spark and SparkSQL: How to imitate window function?

问题描述

问题

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark和SparkSQL：如何模仿窗函数？ [英] Spark and SparkSQL: How to imitate window function?

问题描述

问题

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭