星火添加新列从previous行值到数据帧 [英] Spark add new column to dataframe with value from previous row
问题描述
我不知道我怎么能实现星火以下(Pyspark)
I'm wondering how I can achieve the following in Spark (Pyspark)
初始数据框:
+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|2 |3.0|
+--+---+
|1 |5.0|
+--+---+
所得的数据框:
+--+---+-------+
|id|num|new_Col|
+--+---+-------+
|4 |9.0| 7.0 |
+--+---+-------+
|3 |7.0| 3.0 |
+--+---+-------+
|2 |3.0| 5.0 |
+--+---+-------+
我设法通过使用像一般追加新列到一个数据帧: df.withColumn(new_Col,df.num * 10)
不过,我有我如何能做到这一点新列行转移,使新列有一个字段从previous行的值(如例所示)不知道。我也没找到关于如何通过索引来访问在DF某一行的API文档的任何东西。
However I have no idea on how I can achieve this "shift of rows" for the new column, so that the new column has the value of a field from the previous row (as shown in the example). I also couldn't find anything in the API documentation on how to access a certain row in a DF by index.
任何帮助将是AP preciated。
Any help would be appreciated.
推荐答案
您可以使用滞后
窗函数如下:
You can use lag
window function as follows
from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window
df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
w = Window().partitionBy().orderBy(col("id"))
df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()
## +---+---+-------+
## | id|num|new_col|
## +---+---+-------|
## | 2|3.0| 5.0|
## | 3|7.0| 3.0|
## | 4|9.0| 7.0|
## +---+---+-------+
但也有一些重要的问题:
but there some important issues:
- 如果你需要一个全球性的操作(而不是由一些列/列分区)这是非常低效的。
- 您需要一种自然的方式订购您的数据。
而第二个问题是几乎从来没有一个问题,第一个可能是一个致命弱点。如果是这种情况,你应该简单地转换你的数据帧
来RDD和滞后
手工计算。参见例如:
While the second issue is almost never a problem the first one can be a deal-breaker. If this is the case you should simply convert your DataFrame
to RDD and compute lag
manually. See for example:
- 如何在Pyspark滑动窗口在时间序列数据转换数据
- 阿帕奇星火移动平均(Scala写的,但可以调整PySpark,请务必先阅读评论)。
- How to transform data with sliding window over time series data in Pyspark
- Apache Spark Moving Average (written in Scala, but can be adjusted for PySpark. Be sure to read the comments first).
这篇关于星火添加新列从previous行值到数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!