在scikit-learn中进行预处理后如何保留数据帧的列标题 [英] How to retain column headers of data frame after Pre-processing in scikit-learn
问题描述
我有一个熊猫数据框,其中包含一些行和列.每列都有一个标题.现在,只要我继续在熊猫中进行数据操作,就可以保留我的变量标头.但是,如果尝试使用Sci-kit-learn lib的某些数据预处理功能,最终将丢失所有标头,并且帧将被转换为仅数字矩阵.
I have a pandas data frame which has some rows and columns. Each column has a header. Now as long as I keep doing data manipulation operations in pandas, my variable headers are retained. But if I try some data pre-processing feature of Sci-kit-learn lib, I end up losing all my headers and the frame gets converted to just a matrix of numbers.
我理解为什么会这样,因为scikit-learn给出了一个numpy ndarray作为输出.而且numpy ndarray只是矩阵而不会具有列名.
I understand why it happens because scikit-learn gives a numpy ndarray as output. And numpy ndarray being just matrix would not have column names.
但这是问题.如果我要在数据集上构建一些模型,即使在对初始数据进行了预处理并尝试了某些模型之后,我可能还必须执行更多的数据操作任务才能运行其他模型以更好地拟合.由于我可能不知道特定变量的索引是什么,因此无法访问列标题使数据处理变得困难,但是记住变量名甚至通过执行df.columns都更容易.
But here is the thing. If I am building some model on my dataset, even after initial data pre-processing and trying some model, I might have to do some more data manipulation tasks to run some other model for better fit. Without being able to access column header makes it difficult to do data manipulation as I might not know what is the index of a particular variable, but it's easier to remember variable name or even look up by doing df.columns.
如何克服?
使用样本数据快照进行编辑.
Editing with sample data snapshot.
Pclass Sex Age SibSp Parch Fare Embarked
0 3 0 22 1 0 7.2500 1
1 1 1 38 1 0 71.2833 2
2 3 1 26 0 0 7.9250 1
3 1 1 35 1 0 53.1000 1
4 3 0 35 0 0 8.0500 1
5 3 0 NaN 0 0 8.4583 3
6 1 0 54 0 0 51.8625 1
7 3 0 2 3 1 21.0750 1
8 3 1 27 0 2 11.1333 1
9 2 1 14 1 0 30.0708 2
10 3 1 4 1 1 16.7000 1
11 1 1 58 0 0 26.5500 1
12 3 0 20 0 0 8.0500 1
13 3 0 39 1 5 31.2750 1
14 3 1 14 0 0 7.8542 1
15 2 1 55 0 0 16.0000 1
上面基本上是熊猫数据框.现在,当我在此数据帧上执行此操作时,它将删除列标题.
The above is basically the pandas data frame. Now when I do this on this data frame it will strip the column headers.
from sklearn import preprocessing
X_imputed=preprocessing.Imputer().fit_transform(X_train)
X_imputed
新数据为numpy数组,因此将删除列名称.
New data is of numpy array and hence the column names are stripped.
array([[ 3. , 0. , 22. , ..., 0. ,
7.25 , 1. ],
[ 1. , 1. , 38. , ..., 0. ,
71.2833 , 2. ],
[ 3. , 1. , 26. , ..., 0. ,
7.925 , 1. ],
...,
[ 3. , 1. , 29.69911765, ..., 2. ,
23.45 , 1. ],
[ 1. , 0. , 26. , ..., 0. ,
30. , 2. ],
[ 3. , 0. , 32. , ..., 0. ,
7.75 , 3. ]])
因此,当我对pandas数据框进行一些数据操作时,我想保留列名.
So I want to retain the column names when I do some data manipulation on my pandas data frame.
推荐答案
在大多数情况下,scikit-learn确实会删除列标题,因此,请稍后再添加它们.在您的示例中,将X_imputed
作为sklearn.preprocessing
输出,并将X_train
作为原始数据帧,可以使用以下命令重新放置列标题:
scikit-learn indeed strips the column headers in most cases, so just add them back on afterward. In your example, with X_imputed
as the sklearn.preprocessing
output and X_train
as the original dataframe, you can put the column headers back on with:
X_imputed_df = pd.DataFrame(X_imputed, columns = X_train.columns)
这篇关于在scikit-learn中进行预处理后如何保留数据帧的列标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!