Spark 2.1.1 获取窗口的最后一个元素 [英] Get the last element of a window in Spark 2.1.1

查看:37
本文介绍了Spark 2.1.1 获取窗口的最后一个元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中包含子类别,并且需要每个子类别的最后一个元素.

I have a dataframe in which I have subcategories, and want the last element of each of these subcategories.

val windowSpec = Window.partitionBy("name").orderBy("count")
sqlContext
    .createDataFrame(
      Seq[(String, Int)](
        ("A", 1),
        ("A", 2),
        ("A", 3),
        ("B", 10),
        ("B", 20),
        ("B", 30)
      ))
    .toDF("name", "count")
    .withColumn("firstCountOfName", first("count").over(windowSpec))
    .withColumn("lastCountOfName", last("count").over(windowSpec))
    .show()

返回给我一些奇怪的东西:

returns me something strange:

+----+-----+----------------+---------------+                                   
|name|count|firstCountOfName|lastCountOfName|
+----+-----+----------------+---------------+
|   B|   10|              10|             10|
|   B|   20|              10|             20|
|   B|   30|              10|             30|
|   A|    1|               1|              1|
|   A|    2|               1|              2|
|   A|    3|               1|              3|
+----+-----+----------------+---------------+

正如我们所见,返回的 first 值被正确计算,但 last 不是,它始终是列的当前值.

As we can see, the first value returned is correctly computed, but the last isn't, it's always the current value of the column.

有人可以解决我想做的事情吗?

Has someone a solution to do what I want?

推荐答案

根据问题 SPARK-20969,您应该能够通过为您的窗口定义足够的边界来获得预期的结果,如下所示.

According to the issue SPARK-20969, you should be able to get the expected results by defining adequate bounds to your window, as shown below.

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val windowSpec = Window
  .partitionBy("name")
  .orderBy("count")
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

sqlContext
  .createDataFrame(
    Seq[(String, Int)](
      ("A", 1),
      ("A", 2),
      ("A", 3),
      ("B", 10),
      ("B", 20),
      ("B", 30)
    ))
  .toDF("name", "count")
  .withColumn("firstCountOfName", first("count").over(windowSpec))
  .withColumn("lastCountOfName", last("count").over(windowSpec))
  .show()

或者,如果您在第一次和最后一次计算的同一列上进行排序,则可以使用无序窗口更改 minmax,然后它也应该可以正常工作.

Alternatively, if your are ordering on the same column you are computing first and last, you can change for min and max with a non-ordered window, then it should also work properly.

这篇关于Spark 2.1.1 获取窗口的最后一个元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆