Pyspark - 如何回填数据帧? [英] Pyspark - how to backfill a DataFrame?

查看：29 发布时间：2021/11/14 21:57:05 pandas pyspark spark-dataframe pyspark-sql

本文介绍了Pyspark - 如何回填数据帧?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

你怎么能做和 df.fillna(method='bfill') 用于带有 pyspark.sql.DataFrame 的 Pandas 数据帧>?

How can you do the same thing as df.fillna(method='bfill') for a pandas dataframe with a pyspark.sql.DataFrame?

pyspark 数据框具有 pyspark.sql.DataFrame.fillna 方法，但是不支持 method 参数.

The pyspark dataframe has the pyspark.sql.DataFrame.fillna method, however there is no support for a method parameter.

在 Pandas 中，您可以使用以下内容回填时间序列:

In pandas you can use the following to backfill a time series:

创建数据

import pandas as pd

index = pd.date_range('2017-01-01', '2017-01-05')
data = [1, 2, 3, None, 5]

df = pd.DataFrame({'data': data}, index=index)

给予

Out[1]:
            data
2017-01-01  1.0
2017-01-02  2.0
2017-01-03  3.0
2017-01-04  NaN
2017-01-05  5.0

回填数据框

df = df.fillna(method='bfill')

生成回填框架

Out[2]:
            data
2017-01-01  1.0
2017-01-02  2.0
2017-01-03  3.0
2017-01-04  5.0
2017-01-05  5.0

如何为 pyspark.sql.DataFrame 做同样的事情?

How can the same thing be done for a pyspark.sql.DataFrame?

推荐答案

last 和 first 函数，以及它们的 ignorenulls=True 标志, 可以与 rowsBetween 窗口结合使用.如果我们想向后填充，我们选择当前行和末尾之间的第一个非空值.如果我们想向前填充，我们选择开始和当前行之间的最后一个非空.

The last and first functions, with their ignorenulls=True flags, can be combined with the rowsBetween windowing. If we want to fill backwards, we select the first non-null that is between the current row and the end. If we want to fill forwards, we select the last non-null that is between the beginning and the current row.

from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys

df.withColumn(
  'data',
  F.first(
    F.col('data'),
    ignorenulls=True
  ) \
    .over(
      W.orderBy('date').rowsBetween(0, sys.maxsize)
    )
  )

填充火花的来源:https://towardsdatascience.com/end-to-end-time-series-interpolation-in-pyspark-filling-the-gap-5ccefc6b7fc9

这篇关于Pyspark - 如何回填数据帧?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark - 如何回填数据帧? [英] Pyspark - how to backfill a DataFrame?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pyspark - 如何回填数据帧? [英] Pyspark - how to backfill a DataFrame?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭