如何使用与其他两列匹配的python填充数据集中的空值? [英] How to fill null values in a Dataset using python that matches with two other columns?
问题描述
我有一个巨大的数据集.它具有属性,我在努力工作 1.年龄 2.Embark(从那里登上港口的旅客.共有3个港口:S,Q和C) 3.Survived(0表示没有幸存,1表示没有幸存)
I have a titanic Dataset. It has attributes and i was working manly on 1.Age 2.Embark ( from which port passengers embarked..There are total 3 ports..S,Q and C) 3.Survived ( 0 for did not survived,1 for survived)
我正在过滤无用的数据.然后,我需要填写Age中存在的Null值.因此,我计算了每个登机区中幸存和未幸存的乘客数量,即S,Q和C
I was filtering the useless data. Then i needed to fill Null values present in Age. So i counted how many passengers survived and didn't survived in each Embark i.e. S,Q and C
我找出从每个S,Q和C港口出发后幸存和未幸存的乘客的平均年龄.但是现在我不知道如何在原始的《泰坦尼克号》年龄列中填充这6个值(对于每个S,Q和C来说是3个,对于每个S,Q和C来说都没有幸存的3个……总共6个) .如果我只是简单地执行titanic.Age.fillna('使用六个值之一'),它将使用我不希望的那个值填充Age的所有Null值.
I find out the mean age of Passengers who survived and who did not survived after embarking from each S,Q and C port. But now i have no idea how to fill these 6 values ( 3 for survived from each S,Q and C and 3 for who did not survived from each S,Q and C...So total 6) in the original titanic Age column. If i do simply titanic.Age.fillna('With one of the six values') it will fill All the Null values of Age with that one value which i don't want.
给了一些时间后,我尝试了一下.
After giving some time,i tried this.
titanic[titanic.Survived==1][titanic.Embarked=='S'].Age.fillna(SurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='Q'].Age.fillna(SurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==1][titanic.Embarked=='C'].Age.fillna(SurvivedC.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='S'].Age.fillna(DidntSurvivedS.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='Q'].Age.fillna(DidntSurvivedQ.Age.mean(),inplace=True)
titanic[titanic.Survived==0][titanic.Embarked=='C'].Age.fillna(DidntSurvivedC.Age.mean(),inplace=True)
这没有显示任何错误,但仍然无法正常工作.知道我该怎么办吗?
This showed no error but still it doesn't work. Any idea what should i do?
推荐答案
I think you need groupby
with apply
with fillna
by mean
:
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
import seaborn as sns
titanic = sns.load_dataset('titanic')
#check NaN rows in age
print (titanic[titanic['age'].isnull()].head(10))
survived pclass sex age sibsp parch fare embarked class \
5 0 3 male NaN 0 0 8.4583 Q Third
17 1 2 male NaN 0 0 13.0000 S Second
19 1 3 female NaN 0 0 7.2250 C Third
26 0 3 male NaN 0 0 7.2250 C Third
28 1 3 female NaN 0 0 7.8792 Q Third
29 0 3 male NaN 0 0 7.8958 S Third
31 1 1 female NaN 1 0 146.5208 C First
32 1 3 female NaN 0 0 7.7500 Q Third
36 1 3 male NaN 0 0 7.2292 C Third
42 0 3 male NaN 0 0 7.8958 C Third
who adult_male deck embark_town alive alone
5 man True NaN Queenstown no True
17 man True NaN Southampton yes True
19 woman False NaN Cherbourg yes True
26 man True NaN Cherbourg no True
28 woman False NaN Queenstown yes True
29 man True NaN Southampton no True
31 woman False B Cherbourg yes False
32 woman False NaN Queenstown yes True
36 man True NaN Cherbourg yes True
42 man True NaN Cherbourg no True
idx = titanic[titanic['age'].isnull()].index
titanic['age'] = titanic.groupby(['survived','embarked'])['age']
.apply(lambda x: x.fillna(x.mean()))
#check if values was replaced
print (titanic.loc[idx].head(10))
survived pclass sex age sibsp parch fare embarked \
5 0 3 male 30.325000 0 0 8.4583 Q
17 1 2 male 28.113184 0 0 13.0000 S
19 1 3 female 28.973671 0 0 7.2250 C
26 0 3 male 33.666667 0 0 7.2250 C
28 1 3 female 22.500000 0 0 7.8792 Q
29 0 3 male 30.203966 0 0 7.8958 S
31 1 1 female 28.973671 1 0 146.5208 C
32 1 3 female 22.500000 0 0 7.7500 Q
36 1 3 male 28.973671 0 0 7.2292 C
42 0 3 male 33.666667 0 0 7.8958 C
class who adult_male deck embark_town alive alone
5 Third man True NaN Queenstown no True
17 Second man True NaN Southampton yes True
19 Third woman False NaN Cherbourg yes True
26 Third man True NaN Cherbourg no True
28 Third woman False NaN Queenstown yes True
29 Third man True NaN Southampton no True
31 First woman False B Cherbourg yes False
32 Third woman False NaN Queenstown yes True
36 Third man True NaN Cherbourg yes True
42 Third man True NaN Cherbourg no True
#check mean values
print (titanic.groupby(['survived','embarked'])['age'].mean())
survived embarked
0 C 33.666667
Q 30.325000
S 30.203966
1 C 28.973671
Q 22.500000
S 28.113184
Name: age, dtype: float64
这篇关于如何使用与其他两列匹配的python填充数据集中的空值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!