根据之前可用值和下一个可用值的平均值对数据框中的缺失值进行分组填充 [英] Groupby fill missing values in dataframe based on average of previous values available and next value available
问题描述
我有包含某些组的数据框,并且我想根据得分列的最后一个上一个可用值和下一个可用值平均值来填充缺失值,即(上一个值+下一个值)/2.
我想按州,学校,班级,主题分组,然后填写值.
如果第一个值在分数列中不可用,则用下一个或第二个可用值填充该值如果最后一个值不可用,则用先前可用的值填充该值对于每个小组,都需要遵循.
是数据插补复杂的问题.我在网上搜索后发现,熊猫具有某些功能,即pandas.core.groupby.DataFrameGroupBy.ffill,但不知道在这种情况下如何使用.
我正在考虑使用python,pyspark,SQL解决!
我的数据框看起来像这样
也许这很有帮助-
加载测试数据
df2.show(false)df2.printSchema()/*** + ----- + ----- +* |课程|得分|* + ----- + ----- +* | A |为空|* | A | 46 |* | A |为空|* | A |为空|* | A | 35 |* | A |为空|* | A |为空|* | A |为空|* | A | 46 |* |A |空 |* | A |为空|* | B | 78 |* | B |为空|* | B |为空|* | B |为空|* | B |为空|* | B |为空|* | B | 56 |* |B |空 |* + ----- + ----- +** 根* |-类:字符串(可为空= true)* |-得分:整数(可为空= true)*/
对得分列中的空值进行计数(请检查new_score列)
val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding,Window.currentRow)val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow,Window.unboundedFollowing)df2.withColumn("previous", last("score", ignoreNulls = true).over(w1)).withColumn("next",first("score",ignoreNulls = true).over(w2))..withColumn("new_score",(coalesce($上一个",$下一个")+ coalcece($下一个",$上一个"))/2).drop(下一个",上一个").show(假)/*** + ----- + ----- + --------- +* | class |得分| new_score |* +-----+-----+-----+* | A |空| 46.0 |* | A | 46 | 46.0 |* | A |空| 40.5 |* | A |空| 40.5 |* | A | 35 | 35.0 |* | A |空| 40.5 |* |A |空 |40.5 |* | A |空| 40.5 |* | A | 46 | 46.0 |* | A |空| 46.0 |* | A |空| 46.0 |* | B | 78 | 78.0 |* | B |空| 67.0 |* | B |空| 67.0 |* |B |null |67.0 |* | B |空| 67.0 |* | B |空| 67.0 |* | B | 56 | 56.0 |* | B |空| 56.0 |* + ----- + ----- + --------- +*/
I have data frame which has some groups and I want to fill the missing values based on last previous available and next value available average of score column i.e. (previous value+next value)/2.
I want to group by state,school,class,subject and then fill value.
If the first value not available in score column then fill the value with value which is available next or If the last value not available then fill the value with value which is available previously for each group this needs to be followed.
It is data imputation complex problem. I searched online and found pandas has some functionality i.e. pandas.core.groupby.DataFrameGroupBy.ffill but dont know how to use in this case.
I am thinking to solve in python,pyspark,SQL !
My data frame looks like this
Perhaps this is helpful -
Load the test data
df2.show(false)
df2.printSchema()
/**
* +-----+-----+
* |class|score|
* +-----+-----+
* |A |null |
* |A |46 |
* |A |null |
* |A |null |
* |A |35 |
* |A |null |
* |A |null |
* |A |null |
* |A |46 |
* |A |null |
* |A |null |
* |B |78 |
* |B |null |
* |B |null |
* |B |null |
* |B |null |
* |B |null |
* |B |56 |
* |B |null |
* +-----+-----+
*
* root
* |-- class: string (nullable = true)
* |-- score: integer (nullable = true)
*/
Impute Null values from score columns(check new_score column)
val w1 = Window.partitionBy("class").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val w2 = Window.partitionBy("class").rowsBetween(Window.currentRow, Window.unboundedFollowing)
df2.withColumn("previous", last("score", ignoreNulls = true).over(w1))
.withColumn("next", first("score", ignoreNulls = true).over(w2))
.withColumn("new_score", (coalesce($"previous", $"next") + coalesce($"next", $"previous")) / 2)
.drop("next", "previous")
.show(false)
/**
* +-----+-----+---------+
* |class|score|new_score|
* +-----+-----+---------+
* |A |null |46.0 |
* |A |46 |46.0 |
* |A |null |40.5 |
* |A |null |40.5 |
* |A |35 |35.0 |
* |A |null |40.5 |
* |A |null |40.5 |
* |A |null |40.5 |
* |A |46 |46.0 |
* |A |null |46.0 |
* |A |null |46.0 |
* |B |78 |78.0 |
* |B |null |67.0 |
* |B |null |67.0 |
* |B |null |67.0 |
* |B |null |67.0 |
* |B |null |67.0 |
* |B |56 |56.0 |
* |B |null |56.0 |
* +-----+-----+---------+
*/
这篇关于根据之前可用值和下一个可用值的平均值对数据框中的缺失值进行分组填充的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!