根据 pandas 数据框中的其他分类值填充分类值的缺失值 [英] Filling missing values of categorical values based on other categorical values in pandas dataframe

查看:70
本文介绍了根据 pandas 数据框中的其他分类值填充分类值的缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用另一个类别上最常见的值来填充Pandas数据框中类别值的缺失值.例如,

I want to fill missing values of categorical values in Pandas data frame with the most frequent values on another category. For example,

import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink',    'juice','juice','juice'],
    'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan], 
    'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)
df

这将导致

      price     | product   |   type    
0   25          |  coca     | softdrink   
1   94          |   NaN     | juice    
2   57          |   pepsi   | softdrink    
3   62          |   pepsi   | softdrink    
4   70          |   orange  | juice    
5   50          |    grape  | juice    
6   60          |   NaN     | softdrink    

首先,我将groupby用作

First, I use groupby as

df.groupby('type')['product'].value_counts()   

获取

type      |   product    
juice     |    grape  |   1    
          |   orange  |   1    
softdrink | pepsi     |   2    
          | coca      |   1    
Name: product, dtype: int64    

我想用百事可乐"(第二个频率)来填充第二行的缺失产品,但是要为果汁"类别的第6行的缺失值填充葡萄". 没有分类组,我的解决方案是按列查找最频繁的值,然后将此值分配给缺失值.

I want to fill a missing product of second row with "pepsi" (the most infrequence) but filling "grape" for missing value of row 6 of category "juice". Without categorical group, my solution is to find most frequent value by the column and assign this value to missing value.

df['product'].fillna(df['product'].value_counts().index[0],inplace=True)

由于命令的返回值,我很难完成任务

I struggle to complete the task since the return value of the command

df.groupby('type')['product'].value_counts()

是熊猫系列,可以通过

df.groupby('type')['product'].value_counts()['softdrink']['pepsi']

我怎么知道哪个产品+类别的频率最高.

how I know which product+category has the most frequence.

推荐答案

IIUC

使用mode

数据输入

import pandas as pd
import numpy as np
data = {'type': ['softdrink', 'juice', 'softdrink', 'softdrink',    'juice','juice','softdrink'],
    'product': ['coca', np.nan, 'pepsi', 'pepsi', 'orange','grape',np.nan],
    'price': [25, 94, 57, 62, 70,50,60]}
df = pd.DataFrame(data)


解决方案


solution

df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))

Out[28]: 
0      coca
1     grape
2     pepsi
3     pepsi
4    orange
5     grape
6     pepsi
Name: product, dtype: object


新df


New df

df['product']=df.groupby('type').product.transform(lambda x: x.fillna(x.mode()[0]))
df
Out[40]: 
   price product       type
0     25    coca  softdrink
1     94   grape      juice
2     57   pepsi  softdrink
3     62   pepsi  softdrink
4     70  orange      juice
5     50   grape      juice
6     60   pepsi  softdrink

这篇关于根据 pandas 数据框中的其他分类值填充分类值的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆