在scikit-learn中估算分类缺失值 [英] Impute categorical missing values in scikit-learn

查看：134 发布时间：2020/5/23 21:31:55 python pandas scikit-learn imputation

本文介绍了在scikit-learn中估算分类缺失值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些文本类型的列的熊猫数据.这些文本列中包含一些NaN值.我想做的是通过sklearn.preprocessing.Imputer来归纳那些NaN(用最频繁的值替换NaN).问题在于实施. 假设有一个具有30列的Pandas数据框df，其中10列属于分类性质. 一旦我运行:

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is in implementation. Suppose there is a Pandas dataframe df with 30 columns, 10 of which are of categorical nature. Once I run:

from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df)

Python生成一个error: 'could not convert string to float: 'run1''，其中'run1'是带有分类数据的第一列中的普通(不丢失)值.

Python generates an error: 'could not convert string to float: 'run1'', where 'run1' is an ordinary (non-missing) value from the first column with categorical data.

非常欢迎任何帮助

推荐答案

要对数字列使用平均值，对非数字列使用最频繁的值，您可以执行以下操作.您可以进一步区分整数和浮点数.我想用中位数代替整数列可能很有意义.

To use mean values for numeric columns and the most frequent value for non-numeric columns you could do something like this. You could further distinguish between integers and floats. I guess it might make sense to use the median for integer columns instead.

import pandas as pd
import numpy as np

from sklearn.base import TransformerMixin

class DataFrameImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        Columns of dtype object are imputed with the most frequent value 
        in column.

        Columns of other types are imputed with mean of column.

        """
    def fit(self, X, y=None):

        self.fill = pd.Series([X[c].value_counts().index[0]
            if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
            index=X.columns)

        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)

data = [
    ['a', 1, 2],
    ['b', 1, 1],
    ['b', 2, 2],
    [np.nan, np.nan, np.nan]
]

X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)

print('before...')
print(X)
print('after...')
print(xt)

打印，

before...
     0   1   2
0    a   1   2
1    b   1   1
2    b   2   2
3  NaN NaN NaN
after...
   0         1         2
0  a  1.000000  2.000000
1  b  1.000000  1.000000
2  b  2.000000  2.000000
3  b  1.333333  1.666667

这篇关于在scikit-learn中估算分类缺失值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在scikit-learn中估算分类缺失值 [英] Impute categorical missing values in scikit-learn

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在scikit-learn中估算分类缺失值 [英] Impute categorical missing values in scikit-learn

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭