Pyspark-如何回填DataFrame? [英] Pyspark - how to backfill a DataFrame?
本文介绍了Pyspark-如何回填DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
How can you do the same thing as df.fillna(method='bfill')
for a pandas dataframe with a pyspark.sql.DataFrame
?
The pyspark dataframe has the pyspark.sql.DataFrame.fillna
method, however there is no support for a method
parameter.
在熊猫中,您可以使用以下方法回补时间序列:
In pandas you can use the following to backfill a time series:
创建数据
import pandas as pd
index = pd.date_range('2017-01-01', '2017-01-05')
data = [1, 2, 3, None, 5]
df = pd.DataFrame({'data': data}, index=index)
给予
Out[1]:
data
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 NaN
2017-01-05 5.0
回填数据框
df = df.fillna(method='bfill')
制作回填帧
Out[2]:
data
2017-01-01 1.0
2017-01-02 2.0
2017-01-03 3.0
2017-01-04 5.0
2017-01-05 5.0
如何为pyspark.sql.DataFrame
做同样的事情?
推荐答案
实际上,对分布式数据集进行回填并不像在熊猫(本地)数据帧中那样容易-您无法确定要填充的值在同一分区中.我会在窗口中使用crossJoin,例如fo DF:
Actually backfill on distributed dataset is not as easy task as in pandas (local) dataframe - you cannot be sure that value to fill exists in the same partition. I would use crossJoin with windowing, for example fo DF:
df = spark.createDataFrame([
('2017-01-01', None),
('2017-01-02', 'B'),
('2017-01-03', None),
('2017-01-04', None),
('2017-01-05', 'E'),
('2017-01-06', None),
('2017-01-07', 'G')], ['date', 'value'])
df.show()
+----------+-----+
| date|value|
+----------+-----+
|2017-01-01| null|
|2017-01-02| B|
|2017-01-03| null|
|2017-01-04| null|
|2017-01-05| E|
|2017-01-06| null|
|2017-01-07| G|
+----------+-----+
代码为:
from pyspark.sql.window import Window
df.alias('a').crossJoin(df.alias('b')) \
.where((col('b.date') >= col('a.date')) & (col('a.value').isNotNull() | col('b.value').isNotNull())) \
.withColumn('rn', row_number().over(Window.partitionBy('a.date').orderBy('b.date'))) \
.where(col('rn') == 1) \
.select('a.date', coalesce('a.value', 'b.value').alias('value')) \
.orderBy('a.date') \
.show()
+----------+-----+
| date|value|
+----------+-----+
|2017-01-01| B|
|2017-01-02| B|
|2017-01-03| E|
|2017-01-04| E|
|2017-01-05| E|
|2017-01-06| G|
|2017-01-07| G|
+----------+-----+
这篇关于Pyspark-如何回填DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文