用来自同一列的平均值填充 Pyspark 数据框列空值 [英] Fill Pyspark dataframe column null values with average value from same column
本文介绍了用来自同一列的平均值填充 Pyspark 数据框列空值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
对于这样的数据框,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])
df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|null| 201602|
| 1| 20|3003| 201601|
| 1| 20|null| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|null| 201601|
+---+----+----+-------+
我需要用现有值的平均值填充空值,预期结果为
I need to fill the null values with the average of the existing values, with the expected result being
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
其中 1128
是现有值的平均值.我需要为几列这样做.
where 1128
is the average of the existing values. I need to do that for several columns.
我目前的做法是使用na.fill
:
fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)
+---+----+----+-------+
| id|type|cost| date|
+---+----+----+-------+
| 0| 10| 223| 201601|
| 0| 10| 83|2016032|
| 1| 20|1128| 201602|
| 1| 20|3003| 201601|
| 1| 20|1128| 201603|
| 2| 40|2321| 201601|
| 2| 30| 10| 201602|
| 2| 61|1128| 201601|
+---+----+----+-------+
但这很麻烦.有什么想法吗?
But this is very cumbersome. Any ideas?
推荐答案
好吧,您必须以一种或另一种方式:
Well, one way or another you have to:
- 计算统计数据
- 填空
它几乎限制了你在这里真正可以改进的东西:
It pretty much limits what you can really improve here, still:
- 将
flatMap(list).collect()[0]
替换为first()[0]
或结构解包 - 用一个动作计算所有统计数据
- 使用内置的
Row
方法提取字典
- replace
flatMap(list).collect()[0]
withfirst()[0]
or structure unpacking - compute all stats with a single action
- use built-in
Row
methods to extract dictionary
最终结果可能是这样的:
The final result could like this:
def fill_with_mean(df, exclude=set()):
stats = df.agg(*(
avg(c).alias(c) for c in df.columns if c not in exclude
))
return df.na.fill(stats.first().asDict())
fill_with_mean(df_data, ["id", "date"])
在 Spark 2.2 或更高版本中,您还可以使用 Imputer
.请参阅用均值替换缺失值 - Spark Dataframe.
In Spark 2.2 or later you can also use Imputer
. See Replace missing values with mean - Spark Dataframe.
这篇关于用来自同一列的平均值填充 Pyspark 数据框列空值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文