上一个计算行的 Spark Dataframe 访问 [英] Spark Dataframe access of previous calculated row

查看:27
本文介绍了上一个计算行的 Spark Dataframe 访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据:

+-----+-----+----+
|Col1 |t0   |t1  |
+-----+-----+----+
| A   |null |20  |
| A   |20   |40  |
| B   |null |10  |
| B   |10   |20  |
| B   |20   |120 |
| B   |120  |140 |
| B   |140  |320 |
| B   |320  |340 |
| B   |340  |360 |
+-----+-----+----+

我想要的是这样的:

+-----+-----+----+----+
|Col1 |t0   |t1  |grp |
+-----+-----+----+----+
| A   |null |20  |1A  |
| A   |20   |40  |1A  |
| B   |null |10  |1B  |
| B   |10   |20  |1B  |
| B   |20   |120 |2B  |
| B   |120  |140 |2B  |
| B   |140  |320 |3B  |
| B   |320  |340 |3B  |
| B   |340  |360 |3B  |
+-----+-----+----+----+

说明:额外的列基于 Col1 以及 t1 和 t0 之间的差异.当两者之间的差异太大时 => 会生成一个新数字.(以上数据集中当差值大于50时)

Explanation: The extra column is based on the Col1 and the difference between t1 and t0. When the difference between that two is too high => a new number is generated. (in the dataset above when the difference is greater than 50)

我使用以下方法构建 t0:

I build t0 with:

val windowSpec = Window.partitionBy($"Col1").orderBy("t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)

有人可以帮我怎么做吗?我搜索了但没有得到一个好主意.我有点迷茫,因为我需要grp的前一个计算行的值...

Can someone help me how to do it? I searched but didn't get a good idea. I'm a little bit lost because I need the value of the previous calculated row of grp...

谢谢

推荐答案

我自己解决了

val grp =  (coalesce(
      ($"t" - lag($"t", 1).over(windowSpec)),
      lit(0)
    ) > 50).cast("bigint")

df = df.withColumn("grp", sum(grp).over(windowSpec))

有了这个,我不再需要两个列(t0 和 t1),但只能使用 t1(或 t)而不计算 t0.

With this I don't need both colums (t0 and t1) anymore but can use only t1 (or t) without compute t0.

(我只需要加上 Col1 的值,但最重要的部分数字已经完成并且工作正常.)

(I only need to add the value of Col1 but the most important part the number is done and works fine.)

我得到了解决方案:具有复杂条件的 Spark SQL 窗口函数

感谢您的帮助

这篇关于上一个计算行的 Spark Dataframe 访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆