Pyspark:前向填充数据帧的最后一次观察 [英] Pyspark : forward fill with last observation for a DataFrame

查看:29
本文介绍了Pyspark:前向填充数据帧的最后一次观察的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Spark 1.5.1,

Using Spark 1.5.1,

我一直在尝试使用我的 DataFrame 的一列的最后一个已知观察值向前填充空值.

I've been trying to forward fill null values with the last known observation for one column of my DataFrame.

可以从一个空值开始,在这种情况下,我会用第一个 knwn 观察向后填充这个空值.但是,如果代码太复杂,可以跳过这一点.

It is possible to start with a null value and for this case I would to backward fill this null value with the first knwn observation. However, If that too complicates the code, this point can be skipped.

在这篇帖子中,为zero323 提出了一个非常相似的问题.

In this post, a solution in Scala was provided for a very similar problem by zero323.

但是,我不知道 Scala 并且我没有成功地在 Pyspark API 代码中翻译"它.可以用 Pyspark 做到吗?

But, I don't know Scala and I don't succeed to ''translate'' it in Pyspark API code. It's possible to do it with Pyspark ?

感谢您的帮助.

下面是一个简单的示例输入:

Below, a simple example sample input:

| cookie_ID     | Time       | User_ID   
| ------------- | --------   |------------- 
| 1             | 2015-12-01 | null 
| 1             | 2015-12-02 | U1
| 1             | 2015-12-03 | U1
| 1             | 2015-12-04 | null   
| 1             | 2015-12-05 | null     
| 1             | 2015-12-06 | U2
| 1             | 2015-12-07 | null
| 1             | 2015-12-08 | U1
| 1             | 2015-12-09 | null      
| 2             | 2015-12-03 | null     
| 2             | 2015-12-04 | U3
| 2             | 2015-12-05 | null   
| 2             | 2015-12-06 | U4

以及预期的输出:

| cookie_ID     | Time       | User_ID   
| ------------- | --------   |------------- 
| 1             | 2015-12-01 | U1
| 1             | 2015-12-02 | U1
| 1             | 2015-12-03 | U1
| 1             | 2015-12-04 | U1
| 1             | 2015-12-05 | U1
| 1             | 2015-12-06 | U2
| 1             | 2015-12-07 | U2
| 1             | 2015-12-08 | U1
| 1             | 2015-12-09 | U1
| 2             | 2015-12-03 | U3
| 2             | 2015-12-04 | U3
| 2             | 2015-12-05 | U3
| 2             | 2015-12-06 | U4

推荐答案

另一种解决方案来使这个工作,是尝试这样的事情:

Another workaround to get this working, is to try something like this:

from pyspark.sql import functions as F
from pyspark.sql.window import Window

window = (
    Window
    .partitionBy('cookie_id')
    .orderBy('Time')
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)
)

final = (
    joined
    .withColumn('UserIDFilled', F.last('User_ID', ignorenulls=True).over(window))
)

所以它所做的是根据分区键和顺序列构建您的窗口.它还告诉窗口回顾窗口中的所有行,直到当前行.最后,在每一行,你返回最后一个不为空的值(记住,根据你的窗口,它包括你当前的行)

So what this is doing is that it constructs your window based on the partition key and the order column. It also tells the window to look back all rows within the window up to the current row. Finally, at each row, you return the last value that is not null (which remember, according to your window, it includes your current row)

这篇关于Pyspark:前向填充数据帧的最后一次观察的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆