用来自同一列的平均值填充 Pyspark 数据框列空值 [英] Fill Pyspark dataframe column null values with average value from same column

查看：28 发布时间：2021/11/14 22:02:01 python apache-spark pyspark apache-spark-sql pyspark-sql

本文介绍了用来自同一列的平均值填充 Pyspark 数据框列空值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

对于这样的数据框，

rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,None,"201602"),(1,20,3003,"201601"), (1,20,None,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, None,"201601")])

df_data = sqlContext.createDataFrame(rdd_2, ["id", "type", "cost", "date"])
df_data.show()

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|null| 201602|
|  1|  20|3003| 201601|
|  1|  20|null| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|null| 201601|
+---+----+----+-------+

我需要用现有值的平均值填充空值，预期结果为

I need to fill the null values with the average of the existing values, with the expected result being

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|1128| 201602|
|  1|  20|3003| 201601|
|  1|  20|1128| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|1128| 201601|
+---+----+----+-------+

其中 1128 是现有值的平均值.我需要为几列这样做.

where 1128 is the average of the existing values. I need to do that for several columns.

我目前的做法是使用na.fill:

fill_values = {column: df_data.agg({column:"mean"}).flatMap(list).collect()[0] for column in df_data.columns if column not in ['date','id']}
df_data = df_data.na.fill(fill_values)

+---+----+----+-------+
| id|type|cost|   date|
+---+----+----+-------+
|  0|  10| 223| 201601|
|  0|  10|  83|2016032|
|  1|  20|1128| 201602|
|  1|  20|3003| 201601|
|  1|  20|1128| 201603|
|  2|  40|2321| 201601|
|  2|  30|  10| 201602|
|  2|  61|1128| 201601|
+---+----+----+-------+

但这很麻烦.有什么想法吗?

But this is very cumbersome. Any ideas?

推荐答案

好吧，您必须以一种或另一种方式:

Well, one way or another you have to:

计算统计数据
填空

它几乎限制了你在这里真正可以改进的东西:

It pretty much limits what you can really improve here, still:

将 flatMap(list).collect()[0] 替换为 first()[0] 或结构解包
用一个动作计算所有统计数据
使用内置的Row方法提取字典

replace flatMap(list).collect()[0] with first()[0] or structure unpacking
compute all stats with a single action
use built-in Row methods to extract dictionary

最终结果可能是这样的:

The final result could like this:

def fill_with_mean(df, exclude=set()): 
    stats = df.agg(*(
        avg(c).alias(c) for c in df.columns if c not in exclude
    ))
    return df.na.fill(stats.first().asDict())

fill_with_mean(df_data, ["id", "date"])

在 Spark 2.2 或更高版本中，您还可以使用 Imputer.请参阅用均值替换缺失值 - Spark Dataframe.

In Spark 2.2 or later you can also use Imputer. See Replace missing values with mean - Spark Dataframe.

这篇关于用来自同一列的平均值填充 Pyspark 数据框列空值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用来自同一列的平均值填充 Pyspark 数据框列空值 [英] Fill Pyspark dataframe column null values with average value from same column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用来自同一列的平均值填充 Pyspark 数据框列空值 [英] Fill Pyspark dataframe column null values with average value from same column

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭