如何在Apache Spark中获取上一行的数据 [英] How to get data of previous row in Apache Spark
本文介绍了如何在Apache Spark中获取上一行的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
通过Spark Data框架查找各个城市的上个月销售情况
Find previous month sale of each city from Spark Data frame
|City| Month |Sale|
+----+----------- +----- +
| c1| JAN-2017| 49 |
| c1| FEB-2017| 46 |
| c1| MAR-2017| 83 |
| c2| JAN-2017| 59 |
| c2| MAY-2017| 60 |
| c2| JUN-2017| 49 |
| c2| JUL-2017| 73 |
+----+-----+----+-------
所需的解决方案是
|City| Month |Sale |previous_sale|
+----+-----+-------+-------------+--------
| c1| JAN-2017| 49| NULL |
| c1| FEB-2017| 46| 49 |
| c1| MAR-2017| 83| 46 |
| c2| JAN-2017| 59| NULL |
| c2| MAY-2017| 60| 59 |
| c2| JUN-2017| 49| 60 |
| c2| JUL-2017| 73| 49 |
+----+-----+----+-------------+-----------
请帮助我
推荐答案
You can use lag function to get the previous value
如果要按月份排序,则需要转换为正确的日期.对于"JAN-2017"
到"01-01-2017"
这样的
If you want to sort by month you need to convert to proper date. For "JAN-2017"
to "01-01-2017"
something like this.
import spark.implicits._
val df = spark.sparkContext.parallelize(Seq(
("c1", "JAN-2017", 49),
("c1", "FEB-2017", 46),
("c1", "MAR-2017", 83),
("c2", "JAN-2017", 59),
("c2", "MAY-2017", 60),
("c2", "JUN-2017", 49),
("c2", "JUL-2017", 73)
)).toDF("city", "month", "sales")
val window = Window.partitionBy("city").orderBy("month")
df.withColumn("previous_sale", lag($"sales", 1, null).over(window)).show
输出:
+----+--------+-----+----+
|city| month|sales| previous_sale|
+----+--------+-----+----+
| c1|FEB-2017| 46|null|
| c1|JAN-2017| 49| 46|
| c1|MAR-2017| 83| 49|
| c2|JAN-2017| 59|null|
| c2|JUL-2017| 73| 59|
| c2|JUN-2017| 49| 73|
| c2|MAY-2017| 60| 49|
+----+--------+-----+----+
您可以使用此UDF创建默认日期,例如01/month/year,它将使用日期进行排序,即使它具有不同的年份
You can use this UDF to create a default date like 01/month/year which will be used so sort with date even if it has different year
val fullDate = udf((value :String )=>
{
val months = List("JAN", "FEB", "MAR", "APR", "MAY", "JUN", "JUL", "AUG", "SEP", "OCT", "NOV", "DEC")
val splited = value.split("-")
new Date(splited(1).toInt, months.indexOf(splited(0)) + 1, 1)
})
df.withColumn("month", fullDate($"month")).show()
希望这个帮助!
这篇关于如何在Apache Spark中获取上一行的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文