在scikit-learn中进行预处理后如何保留数据帧的列标题 [英] How to retain column headers of data frame after Pre-processing in scikit-learn

查看:215
本文介绍了在scikit-learn中进行预处理后如何保留数据帧的列标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个熊猫数据框,其中包含一些行和列.每列都有一个标题.现在,只要我继续在熊猫中进行数据操作,就可以保留我的变量标头.但是,如果尝试使用Sci-kit-learn lib的某些数据预处理功能,最终将丢失所有标头,并且帧将被转换为仅数字矩阵.

I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers.

我理解为什么会这样,因为scikit-learn给出了一个numpy ndarray作为输出.而且numpy ndarray只是矩阵而不会具有列名.

I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names.

但这是问题.如果我要在数据集上构建一些模型,即使在对初始数据进行了预处理并尝试了某些模型之后,我可能还必须执行更多的数据操作任务才能运行其他模型以更好地拟合.由于我可能不知道特定变量的索引是什么,因此无法访问列标题使数据处理变得困难,但是记住变量名甚至通过执行df.columns都更容易.

But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns.

如何克服?

使用样本数据快照进行编辑.

Editing with sample data snapshot.

    Pclass  Sex Age SibSp   Parch   Fare    Embarked
0   3   0   22  1   0   7.2500  1
1   1   1   38  1   0   71.2833 2
2   3   1   26  0   0   7.9250  1
3   1   1   35  1   0   53.1000 1
4   3   0   35  0   0   8.0500  1
5   3   0   NaN 0   0   8.4583  3
6   1   0   54  0   0   51.8625 1
7   3   0   2   3   1   21.0750 1
8   3   1   27  0   2   11.1333 1
9   2   1   14  1   0   30.0708 2
10  3   1   4   1   1   16.7000 1
11  1   1   58  0   0   26.5500 1
12  3   0   20  0   0   8.0500  1
13  3   0   39  1   5   31.2750 1
14  3   1   14  0   0   7.8542  1
15  2   1   55  0   0   16.0000 1

上面基本上是熊猫数据框.现在,当我在此数据帧上执行此操作时,它将删除列标题.

The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers.

from sklearn import preprocessing 
X_imputed=preprocessing.Imputer().fit_transform(X_train) 
X_imputed

新数据为numpy数组,因此将删除列名称.

New data is of numpy array and hence the column names are stripped.

array([[  3.        ,   0.        ,  22.        , ...,   0.        ,
          7.25      ,   1.        ],
       [  1.        ,   1.        ,  38.        , ...,   0.        ,
         71.2833    ,   2.        ],
       [  3.        ,   1.        ,  26.        , ...,   0.        ,
          7.925     ,   1.        ],
       ..., 
       [  3.        ,   1.        ,  29.69911765, ...,   2.        ,
         23.45      ,   1.        ],
       [  1.        ,   0.        ,  26.        , ...,   0.        ,
         30.        ,   2.        ],
       [  3.        ,   0.        ,  32.        , ...,   0.        ,
          7.75      ,   3.        ]])

因此,当我对pandas数据框进行一些数据操作时,我想保留列名.

So I want to retain the column names when I do some data manipulation on my pandas data frame.

推荐答案

在大多数情况下,scikit-learn确实会删除列标题,因此,请稍后再添加它们.在您的示例中,将X_imputed作为sklearn.preprocessing输出,并将X_train作为原始数据帧,可以使用以下命令重新放置列标题:

scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with X_imputed as the sklearn.preprocessing output and X_train as the original dataframe, you can put the column headers back on with:

X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns)

这篇关于在scikit-learn中进行预处理后如何保留数据帧的列标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆