PySpark/Spark 窗口函数第一/最后一期 [英] PySpark / Spark Window Function First/ Last Issue

查看:30
本文介绍了PySpark/Spark 窗口函数第一/最后一期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据我的理解,Spark 中的 first/last 函数将检索每个分区的第一行/最后一行/我无法理解为什么 LAST 函数给出的结果不正确.

From my understanding first/ last function in Spark will retrieve first / last row of each partition/ I am not able to understand why LAST function is giving incorrect results.

这是我的代码.

AgeWindow = Window.partitionBy('Dept').orderBy('Age')
df1 = df1.withColumn('first(ID)', first('ID').over(AgeWindow))\
        .withColumn('last(ID)', last('ID').over(AgeWindow))           
df1.show()

+---+----------+---+--------+--------------------------+-------------------------+
|Age|      Dept| ID|    Name|first(ID)                 |last(ID)                |
+---+----------+---+--------+--------------------------+-------------------------+
| 38|  medicine|  4|   harry|                         4|                        4|
| 41|  medicine|  5|hermione|                         4|                        5|
| 55|  medicine|  7| gandalf|                         4|                        7|
| 15|technology|  6|  sirius|                         6|                        6|
| 49|technology|  9|     sam|                         6|                        9|
| 88|technology|  1|     sam|                         6|                        2|
| 88|technology|  2|     nik|                         6|                        2|
| 75|       mba|  8|   ginny|                         8|                       11|
| 75|       mba| 10|     sam|                         8|                       11|
| 75|       mba|  3|     ron|                         8|                       11|
| 75|       mba| 11|     ron|                         8|                       11|
+---+----------+---+--------+--------------------------+-------------------------+

推荐答案

没有错.您的窗口定义与您认为的不同.

It is not incorrect. Your window definition is just not what you think it is.

如果您提供 ORDER BY 子句,则默认框架为 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:

If you provide ORDER BY clause then the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:

from pyspark.sql.window import Window
from pyspark.sql.functions import first, last

w = Window.partitionBy('Dept').orderBy('Age')

df = spark.createDataFrame(
    [(38, "medicine", 4), (41, "medicine", 5), (55, "medicine", 7)],
    ("Age", "Dept", "ID")
)

df.select(
    "*",
    first('ID').over(w).alias("first_id"), 
    last('ID').over(w).alias("last_id")
).explain()

== Physical Plan ==
Window [first(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS first_id#38L, last(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS last_id#40L], [Dept#23], [Age#22L ASC NULLS FIRST]
+- *(1) Sort [Dept#23 ASC NULLS FIRST, Age#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Dept#23, 200)
      +- Scan ExistingRDD[Age#22L,Dept#23,ID#24L]

这意味着窗口函数永远不会向前看,框架中的最后一行是当前行.

This means that the window function never looks ahead and the last row in the frame is the current row.

您应该将窗口重新定义为

You should redefine the window as

w_uf = (Window
   .partitionBy('Dept')
   .orderBy('Age')
   .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

result = df.select(
    "*", 
    first('ID').over(w_uf).alias("first_id"),
    last('ID').over(w_uf).alias("last_id")
)

== Physical Plan ==
Window [first(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS first_id#56L, last(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS last_id#58L], [Dept#23], [Age#22L ASC NULLS FIRST]
+- *(1) Sort [Dept#23 ASC NULLS FIRST, Age#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Dept#23, 200)
      +- Scan ExistingRDD[Age#22L,Dept#23,ID#24L]

result.show()

+---+--------+---+--------+-------+
|Age|    Dept| ID|first_id|last_id|
+---+--------+---+--------+-------+
| 38|medicine|  4|       4|      7|
| 41|medicine|  5|       4|      7|
| 55|medicine|  7|       4|      7|
+---+--------+---+--------+-------+

这篇关于PySpark/Spark 窗口函数第一/最后一期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆