根据之前可用值和下一个可用值的平均值对数据框中的缺失值进行分组填充 [英] Groupby fill missing values in dataframe based on average of previous values available and next value available

查看:38
本文介绍了根据之前可用值和下一个可用值的平均值对数据框中的缺失值进行分组填充的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有包含某些组的数据框,并且我想根据得分列的最后一个上一个可用值和下一个可用值平均值来填充缺失值,即(上一个值+下一个值)/2.

我想按州,学校,班级,主题分组,然后填写值.

如果第一个值在分数列中不可用,则用下一个或第二个可用值填充该值如果最后一个值不可用,则用先前可用的值填充该值对于每个小组,都需要遵循.

是数据插补复杂的问题.我在网上搜索后发现,熊猫具有某些功能,即pandas.core.groupby.DataFrameGroupBy.ffill,但不知道在这种情况下如何使用.

我正在考虑使用python,pyspark,SQL解决!

我的数据框看起来像这样

解决方案

也许这很有帮助-

加载测试数据

df2.show(false)df2.printSchema()/*** + ----- + ----- +* |课程|得分|* + ----- + ----- +* | A |为空|* | A | 46 |* | A |为空|* | A |为空|* | A | 35 |* | A |为空|* | A |为空|* | A |为空|* | A | 46 |* |A |空 |* | A |为空|* | B | 78 |* | B |为空|* | B |为空|* | B |为空|* | B |为空|* | B |为空|* | B | 56 |* |B |空 |* + ----- + ----- +** 根* |-类:字符串(可为空= true)* |-得分:整数(可为空= true)*/ 

对得分列中的空值进行计数(请检查new_score列)

 val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding,Window.currentRow)val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow,Window.unboundedFollowing)df2.withColumn("previous", last("score", ignoreNulls = true).over(w1)).withColumn("next",first("score",ignoreNulls = true).over(w2))..withColumn("new_score",(coalesce($上一个",$下一个")+ coalcece($下一个",$上一个"))/2).drop(下一个",上一个").show(假)/*** + ----- + ----- + --------- +* | class |得分| new_score |* +-----+-----+-----+* | A |空| 46.0 |* | A | 46 | 46.0 |* | A |空| 40.5 |* | A |空| 40.5 |* | A | 35 | 35.0 |* | A |空| 40.5 |* |A |空 |40.5 |* | A |空| 40.5 |* | A | 46 | 46.0 |* | A |空| 46.0 |* | A |空| 46.0 |* | B | 78 | 78.0 |* | B |空| 67.0 |* | B |空| 67.0 |* |B |null |67.0 |* | B |空| 67.0 |* | B |空| 67.0 |* | B | 56 | 56.0 |* | B |空| 56.0 |* + ----- + ----- + --------- +*/ 

I have data frame which has some groups and I want to fill the missing values based on last previous available and next value available average of score column i.e. (previous value+next value)/2.

I want to group by state,school,class,subject and then fill value.

If the first value not available in score column then fill the value with value which is available next or If the last value not available then fill the value with value which is available previously for each group this needs to be followed.

It is data imputation complex problem. I searched online and found pandas has some functionality i.e. pandas.core.groupby.DataFrameGroupBy.ffill but dont know how to use in this case.

I am thinking to solve in python,pyspark,SQL !

My data frame looks like this

解决方案

Perhaps this is helpful -

Load the test data

df2.show(false)
    df2.printSchema()
    /**
      * +-----+-----+
      * |class|score|
      * +-----+-----+
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |A    |35   |
      * |A    |null |
      * |A    |null |
      * |A    |null |
      * |A    |46   |
      * |A    |null |
      * |A    |null |
      * |B    |78   |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |null |
      * |B    |56   |
      * |B    |null |
      * +-----+-----+
      *
      * root
      * |-- class: string (nullable = true)
      * |-- score: integer (nullable = true)
      */

Impute Null values from score columns(check new_score column)


    val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding, Window.currentRow)
    val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow, Window.unboundedFollowing)
    df2.withColumn("previous", last("score", ignoreNulls = true).over(w1))
      .withColumn("next", first("score", ignoreNulls = true).over(w2))
      .withColumn("new_score", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
      .drop("next", "previous")
      .show(false)

    /**
      * +-----+-----+---------+
      * |class|score|new_score|
      * +-----+-----+---------+
      * |A    |null |46.0     |
      * |A    |46   |46.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |35   |35.0     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |null |40.5     |
      * |A    |46   |46.0     |
      * |A    |null |46.0     |
      * |A    |null |46.0     |
      * |B    |78   |78.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |null |67.0     |
      * |B    |56   |56.0     |
      * |B    |null |56.0     |
      * +-----+-----+---------+
      */

这篇关于根据之前可用值和下一个可用值的平均值对数据框中的缺失值进行分组填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆