Spark数据帧中两行之间的差异 [英] Difference between two rows in Spark dataframe
问题描述
我通过groupby列1和日期在Spark中创建了一个数据框,并计算了金额.
I created a dataframe in Spark, by groupby column1 and date and calculated the amount.
val table = df1.groupBy($"column1",$"date").sum("amount")
Column1 |Date |Amount
A |1-jul |1000
A |1-june |2000
A |1-May |2000
A |1-dec |3000
A |1-Nov |2000
B |1-jul |100
B |1-june |300
B |1-May |400
B |1-dec |300
现在,我想添加一个新列,其中表中任意两个日期的数量之间存在差异.
Now, I want to add new column, with difference between amount of any two dates from the table.
推荐答案
如果计算固定为计算前两个月之间的差额,或者计算前两个月之间的,则可以使用Window
函数两个月 ...等.为此,您可以将lag
和lead
函数与Window
一起使用.
You can use Window
function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag
and lead
function with Window
.
但是为此,您需要更改日期列,如下所示,以便可以对其进行订购.
But for that you need to change the date column as below so that it can be ordered.
+-------+------+--------------+------+
|Column1|Date |Date_Converted|Amount|
+-------+------+--------------+------+
|A |1-jul |2017-07-01 |1000 |
|A |1-june|2017-06-01 |2000 |
|A |1-May |2017-05-01 |2000 |
|A |1-dec |2017-12-01 |3000 |
|A |1-Nov |2017-11-01 |2000 |
|B |1-jul |2017-07-01 |100 |
|B |1-june|2017-06-01 |300 |
|B |1-May |2017-05-01 |400 |
|B |1-dec |2017-12-01 |300 |
+-------+------+--------------+------+
您可以通过执行
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
.show(false)
您应该拥有
+-------+------+--------------+------+------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |-100.0 |
|B |1-jul |2017-07-01 |100 |-200.0 |
|B |1-dec |2017-12-01 |300 |200.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |0.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |1000.0 |
|A |1-dec |2017-12-01 |3000 |1000.0 |
+-------+------+--------------+------+------------------------+
您可以将前两个月的滞后排名提高为
You can increase the lagging position for previous two months as
df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
.show(false)
这将为您提供
+-------+------+--------------+------+----------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |300.0 |
|B |1-jul |2017-07-01 |100 |-300.0 |
|B |1-dec |2017-12-01 |300 |0.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |2000.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |0.0 |
|A |1-dec |2017-12-01 |3000 |2000.0 |
+-------+------+--------------+------+----------------------------+
我希望答案会有所帮助
这篇关于Spark数据帧中两行之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!