根据 pandas 数据框中的其他分类值填充分类值的缺失值 [英] Filling missing values of categorical values based on other categorical values in pandas dataframe
问题描述
我想用另一个类别上最常见的值来填充Pandas数据框中类别值的缺失值.例如,
I want to fill missing values of categorical values in Pandas data frame with the most frequent values on another category. For example,
import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink', 'juice','juice','juice'],
'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan],
'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)
df
这将导致
price | product | type
0 25 | coca | softdrink
1 94 | NaN | juice
2 57 | pepsi | softdrink
3 62 | pepsi | softdrink
4 70 | orange | juice
5 50 | grape | juice
6 60 | NaN | softdrink
首先,我将groupby用作
First, I use groupby as
df.groupby('type')['product'].value_counts()
获取
type | product
juice | grape | 1
| orange | 1
softdrink | pepsi | 2
| coca | 1
Name: product, dtype: int64
我想用百事可乐"(第二个频率)来填充第二行的缺失产品,但是要为果汁"类别的第6行的缺失值填充葡萄". 没有分类组,我的解决方案是按列查找最频繁的值,然后将此值分配给缺失值.
I want to fill a missing product of second row with "pepsi" (the most infrequence) but filling "grape" for missing value of row 6 of category "juice". Without categorical group, my solution is to find most frequent value by the column and assign this value to missing value.
df['product'].fillna(df['product'].value_counts().index[0],inplace=True)
由于命令的返回值,我很难完成任务
I struggle to complete the task since the return value of the command
df.groupby('type')['product'].value_counts()
是熊猫系列,可以通过
df.groupby('type')['product'].value_counts()['softdrink']['pepsi']
我怎么知道哪个产品+类别的频率最高.
how I know which product+category has the most frequence.
推荐答案
IIUC
使用mode
数据输入
import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink', 'juice','juice','softdrink'],
'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan],
'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)
解决方案
solution
df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))
Out[28]:
0 coca
1 grape
2 pepsi
3 pepsi
4 orange
5 grape
6 pepsi
Name: product, dtype: object
新df
New df
df['product']=df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))
df
Out[40]:
price product type
0 25 coca softdrink
1 94 grape juice
2 57 pepsi softdrink
3 62 pepsi softdrink
4 70 orange juice
5 50 grape juice
6 60 pepsi softdrink
这篇关于根据 pandas 数据框中的其他分类值填充分类值的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!