Pyspark - 如何回填数据帧? [英] Pyspark - how to backfill a DataFrame?

查看:29
本文介绍了Pyspark - 如何回填数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你怎么能做和 df.fillna(method='bfill') 用于带有 pyspark.sql.DataFrame 的 Pandas 数据帧>?

How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame?

pyspark 数据框具有 pyspark.sql.DataFrame.fillna 方法,但是不支持 method 参数.

The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter.

在 Pandas 中,您可以使用以下内容回填时间序列:

In pandas you can use the following to backfill a time series:

创建数据

import pandas as pd

index = pd.date_range('2017-01-01', '2017-01-05')
data = [1, 2, 3, None, 5]

df = pd.DataFrame({'data': data}, index=index)

给予

Out[1]:
            data
2017-01-01  1.0
2017-01-02  2.0
2017-01-03  3.0
2017-01-04  NaN
2017-01-05  5.0

回填数据框

df = df.fillna(method='bfill')

生成回填框架

Out[2]:
            data
2017-01-01  1.0
2017-01-02  2.0
2017-01-03  3.0
2017-01-04  5.0
2017-01-05  5.0

如何为 pyspark.sql.DataFrame 做同样的事情?

How can the same thing be done for a pyspark.sql.DataFrame?

推荐答案

lastfirst 函数,以及它们的 ignorenulls=True 标志, 可以与 rowsBetween 窗口结合使用.如果我们想向后填充,我们选择当前行和末尾之间的第一个非空值.如果我们想向前填充,我们选择开始和当前行之间的最后一个非空.

The last and first functions, with their ignorenulls=True flags, can be combined with the rowsBetween windowing. If we want to fill backwards, we select the first non-null that is between the current row and the end. If we want to fill forwards, we select the last non-null that is between the beginning and the current row.

from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys

df.withColumn(
  'data',
  F.first(
    F.col('data'),
    ignorenulls=True
  ) \
    .over(
      W.orderBy('date').rowsBetween(0, sys.maxsize)
    )
  )

填充火花的来源:https://towardsdatascience.com/end-to-end-time-series-interpolation-in-pyspark-filling-the-gap-5ccefc6b7fc9

这篇关于Pyspark - 如何回填数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆