Pyspark - 如何回填数据帧? [英] Pyspark - how to backfill a DataFrame?
问题描述
你怎么能做和 df.fillna(method='bfill')
用于带有 pyspark.sql.DataFrame
的 Pandas 数据帧>?
How can you do the same thing as df.fillna(method='bfill')
for a pandas dataframe with a pyspark.sql.DataFrame
?
pyspark 数据框具有 pyspark.sql.DataFrame.fillna
方法,但是不支持 method
参数.
The pyspark dataframe has the pyspark.sql.DataFrame.fillna
method, however there is no support for a method
parameter.
在 Pandas 中,您可以使用以下内容回填时间序列:
In pandas you can use the following to backfill a time series:
创建数据
import pandas as pd
index = pd.date_range('2017-01-01', '2017-01-05')
data = [1, 2, 3, None, 5]
df = pd.DataFrame({'data': data}, index=index)
给予
Out[1]:
data
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 NaN
2017-01-05 5.0
回填数据框
df = df.fillna(method='bfill')
生成回填框架
Out[2]:
data
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 5.0
2017-01-05 5.0
如何为 pyspark.sql.DataFrame
做同样的事情?
How can the same thing be done for a pyspark.sql.DataFrame
?
推荐答案
last
和 first
函数,以及它们的 ignorenulls=True
标志, 可以与 rowsBetween
窗口结合使用.如果我们想向后填充,我们选择当前行和末尾之间的第一个非空值.如果我们想向前填充,我们选择开始和当前行之间的最后一个非空.
The last
and first
functions, with their ignorenulls=True
flags, can be combined with the rowsBetween
windowing. If we want to fill backwards, we select the first non-null that is between the current row and the end. If we want to fill forwards, we select the last non-null that is between the beginning and the current row.
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
import sys
df.withColumn(
'data',
F.first(
F.col('data'),
ignorenulls=True
) \
.over(
W.orderBy('date').rowsBetween(0, sys.maxsize)
)
)
这篇关于Pyspark - 如何回填数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!