在scikit-learn中预处理后如何保留数据框的列标题 [英] How to retain column headers of data frame after Pre-processing in scikit-learn

查看:17
本文介绍了在scikit-learn中预处理后如何保留数据框的列标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含一些行和列的 Pandas 数据框.每列都有一个标题.现在只要我继续在 Pandas 中进行数据操作,我的变量头就会被保留下来.但是,如果我尝试使用 Sci-kit-learn 库的一些数据预处理功能,我最终会丢失所有标题,并且框架将转换为仅数字矩阵.

I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers.

我理解为什么会发生这种情况,因为 scikit-learn 给出了一个 numpy ndarray 作为输出.而 numpy ndarray 只是矩阵不会有列名.

I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names.

但事情就是这样.如果我在我的数据集上构建一些模型,即使在初始数据预处理和尝试一些模型之后,我可能需要做一些更多的数据操作任务来运行一些其他模型以获得更好的拟合.无法访问列标题使数据操作变得困难,因为我可能不知道特定变量的索引是什么,但通过执行 df.columns 更容易记住变量名称甚至查找.

But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns.

如何克服?

使用示例数据快照进行编辑.

Editing with sample data snapshot.

    Pclass  Sex Age SibSp   Parch   Fare    Embarked
0   3   0   22  1   0   7.2500  1
1   1   1   38  1   0   71.2833 2
2   3   1   26  0   0   7.9250  1
3   1   1   35  1   0   53.1000 1
4   3   0   35  0   0   8.0500  1
5   3   0   NaN 0   0   8.4583  3
6   1   0   54  0   0   51.8625 1
7   3   0   2   3   1   21.0750 1
8   3   1   27  0   2   11.1333 1
9   2   1   14  1   0   30.0708 2
10  3   1   4   1   1   16.7000 1
11  1   1   58  0   0   26.5500 1
12  3   0   20  0   0   8.0500  1
13  3   0   39  1   5   31.2750 1
14  3   1   14  0   0   7.8542  1
15  2   1   55  0   0   16.0000 1

以上基本就是pandas数据框.现在,当我在此数据框上执行此操作时,它将去除列标题.

The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers.

from sklearn import preprocessing 
X_imputed=preprocessing.Imputer().fit_transform(X_train) 
X_imputed

新数据是 numpy 数组,因此列名被剥离.

New data is of numpy array and hence the column names are stripped.

array([[  3.        ,   0.        ,  22.        , ...,   0.        ,
          7.25      ,   1.        ],
       [  1.        ,   1.        ,  38.        , ...,   0.        ,
         71.2833    ,   2.        ],
       [  3.        ,   1.        ,  26.        , ...,   0.        ,
          7.925     ,   1.        ],
       ..., 
       [  3.        ,   1.        ,  29.69911765, ...,   2.        ,
         23.45      ,   1.        ],
       [  1.        ,   0.        ,  26.        , ...,   0.        ,
         30.        ,   2.        ],
       [  3.        ,   0.        ,  32.        , ...,   0.        ,
          7.75      ,   3.        ]])

所以当我对我的 Pandas 数据框进行一些数据操作时,我想保留列名.

So I want to retain the column names when I do some data manipulation on my pandas data frame.

推荐答案

scikit-learn 在大多数情况下确实会去除列标题,因此只需在之后重新添加它们即可.在您的示例中,使用 X_imputed 作为 sklearn.preprocessing 输出和 X_train 作为原始数据框,您可以将列标题放回:

scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with X_imputed as the sklearn.preprocessing output and X_train as the original dataframe, you can put the column headers back on with:

X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns)

这篇关于在scikit-learn中预处理后如何保留数据框的列标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆