在scikit-learn中估算分类缺失值 [英] Impute categorical missing values in scikit-learn
问题描述
我有一些文本类型的列的熊猫数据.这些文本列中包含一些NaN值.我想做的是通过sklearn.preprocessing.Imputer
来归纳那些NaN(用最频繁的值替换NaN).问题在于实施.
假设有一个具有30列的Pandas数据框df,其中10列属于分类性质.
一旦我运行:
I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer
(replacing NaN by the most frequent value). The problem is in implementation.
Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature.
Once I run:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df)
Python生成一个error: 'could not convert string to float: 'run1''
,其中'run1'是带有分类数据的第一列中的普通(不丢失)值.
Python generates an error: 'could not convert string to float: 'run1''
, where 'run1' is an ordinary (non-missing) value from the first column with categorical data.
非常欢迎任何帮助
推荐答案
要对数字列使用平均值,对非数字列使用最频繁的值,您可以执行以下操作.您可以进一步区分整数和浮点数.我想用中位数代替整数列可能很有意义.
To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
def __init__(self):
"""Impute missing values.
Columns of dtype object are imputed with the most frequent value
in column.
Columns of other types are imputed with mean of column.
"""
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
data = [
['a', 1, 2],
['b', 1, 1],
['b', 2, 2],
[np.nan, np.nan, np.nan]
]
X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)
print('before...')
print(X)
print('after...')
print(xt)
打印,
before...
0 1 2
0 a 1 2
1 b 1 1
2 b 2 2
3 NaN NaN NaN
after...
0 1 2
0 a 1.000000 2.000000
1 b 1.000000 1.000000
2 b 2.000000 2.000000
3 b 1.333333 1.666667
这篇关于在scikit-learn中估算分类缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!