Spark数据帧中两行之间的差异 [英] Difference between two rows in Spark dataframe

查看:193
本文介绍了Spark数据帧中两行之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过groupby列1和日期在Spark中创建了一个数据框,并计算了金额.

I created a dataframe in Spark, by groupby column1 and date and calculated the amount.

val table = df1.groupBy($"column1",$"date").sum("amount")

Column1 |Date   |Amount
A   |1-jul  |1000
A   |1-june |2000
A   |1-May  |2000
A   |1-dec  |3000
A   |1-Nov  |2000
B   |1-jul  |100
B   |1-june |300    
B   |1-May  |400
B   |1-dec  |300

现在,我想添加一个新列,其中表中任意两个日期的数量之间存在差异.

Now, I want to add new column, with difference between amount of any two dates from the table.

推荐答案

如果计算固定为计算前两个月之间的差额,或者计算前两个月之间的,则可以使用Window函数两个月 ...等.为此,您可以将laglead函数与Window一起使用.

You can use Window function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag and lead function with Window.

但是为此,您需要更改日期列,如下所示,以便可以对其进行订购.

But for that you need to change the date column as below so that it can be ordered.

+-------+------+--------------+------+
|Column1|Date  |Date_Converted|Amount|
+-------+------+--------------+------+
|A      |1-jul |2017-07-01    |1000  |
|A      |1-june|2017-06-01    |2000  |
|A      |1-May |2017-05-01    |2000  |
|A      |1-dec |2017-12-01    |3000  |
|A      |1-Nov |2017-11-01    |2000  |
|B      |1-jul |2017-07-01    |100   |
|B      |1-june|2017-06-01    |300   |
|B      |1-May |2017-05-01    |400   |
|B      |1-dec |2017-12-01    |300   |
+-------+------+--------------+------+

您可以通过执行

import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
   .show(false)

您应该拥有

+-------+------+--------------+------+------------------------+
|Column1|Date  |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B      |1-May |2017-05-01    |400   |400.0                   |
|B      |1-june|2017-06-01    |300   |-100.0                  |
|B      |1-jul |2017-07-01    |100   |-200.0                  |
|B      |1-dec |2017-12-01    |300   |200.0                   |
|A      |1-May |2017-05-01    |2000  |2000.0                  |
|A      |1-june|2017-06-01    |2000  |0.0                     |
|A      |1-jul |2017-07-01    |1000  |-1000.0                 |
|A      |1-Nov |2017-11-01    |2000  |1000.0                  |
|A      |1-dec |2017-12-01    |3000  |1000.0                  |
+-------+------+--------------+------+------------------------+

您可以将前两个月的滞后排名提高为

You can increase the lagging position for previous two months as

df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
  .show(false)

这将为您提供

+-------+------+--------------+------+----------------------------+
|Column1|Date  |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B      |1-May |2017-05-01    |400   |400.0                       |
|B      |1-june|2017-06-01    |300   |300.0                       |
|B      |1-jul |2017-07-01    |100   |-300.0                      |
|B      |1-dec |2017-12-01    |300   |0.0                         |
|A      |1-May |2017-05-01    |2000  |2000.0                      |
|A      |1-june|2017-06-01    |2000  |2000.0                      |
|A      |1-jul |2017-07-01    |1000  |-1000.0                     |
|A      |1-Nov |2017-11-01    |2000  |0.0                         |
|A      |1-dec |2017-12-01    |3000  |2000.0                      |
+-------+------+--------------+------+----------------------------+

我希望答案会有所帮助

这篇关于Spark数据帧中两行之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆