编码分类变量后如何跟踪列? [英] How to keep track of columns after encoding categorical variables?
问题描述
我想知道一旦对数据集进行数据预处理,如何跟踪数据集的原始列?
I am wondering how I can keep track of the original columns of a dataset once I perform data preprocessing on it?
在下面的代码中, df_columns
会告诉我, df_array
中的 0
列是 A
,而列> 1
是 B
,依此类推...
In the below code df_columns
would tell me that column 0
in df_array
is A
, column 1
is B
and so forth...
但是,一旦我对分类列 B
df_columns
进行编码时,不再对跟踪 df_dummies
However when once I encode categorical column B
df_columns
is no longer valid for keeping track of df_dummies
import pandas as pd
import numpy as np
animal = ['dog','cat','horse']
df = pd.DataFrame({'A': np.random.rand(9),
'B': [animal[np.random.randint(3)] for i in range(9)],
'C': np.random.rand(9),
'D': np.random.rand(9)})
df_array = df.values
df_columns = df.columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
df_dummies = np.array(ct.fit_transform(df_array), dtype=np.float)
该解决方案应该与分类列的位置无关...是 A
, B
, C
或D
.我可以做艰苦的工作,并不断更新 df_columns
词典...但是它不是优雅或"pythonic"
The solution should be agnostic of the position of the categorical column... be it A
, B
, C
or D
. I can do the grunt work and keep updating the df_columns
dictionary... but it wouldn't be elegant or "pythonic"
此外...解决方案将如何跟踪类别的含义?{0,0,1}是猫,{0,1,0}是狗,依此类推?
Furthermore... how would the solution look to keep track of what the categoricals mean? {0,0,1} would be cat, {0,1,0} would be dog and so on?
PS-我知道虚拟变量陷阱,在我实际使用它来训练模型时会使用 df_dummies [:,1:]
.
PS - I am aware of the dummy variable trap and will take df_dummies[:,1:]
when I actually use it to train my model.
推荐答案
您能否确认将来的数据集是否将继续具有相同的列名?如果我的问题正确无误,您要做的就是从原始数据框中保存 df_columns
并使用它为新数据框重新编制索引.
Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns
from the original data frame and use it to reindex your new dataframe.
new_df_reindexed = new_df[df_columns]
要回答其他问题,您可以使用熊猫的 get_dummies()
对数据进行一键编码.使用 drop_first
参数删除生成的列值之一,并避免使用哑变量陷阱.另外,保存一帧热编码数据帧的列列表.
To answer your other questions, you can one-hot encode your data using get_dummies()
from pandas. Use the drop_first
parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.
为确保新的/测试/保持数据集与模型训练中使用的列定义相同,
To ensure that you new / testing / holdout data set has the same column definition as that used in model training,
- 首先使用
get_dummies()
对新数据集进行一次热编码. - 使用熊猫
reindex
将新数据框带入与模型训练中使用的数据框相同的结构-df.reindex(columns = train_one_hot_encode_col_list,axis ="columns")
. - 上面将为训练数据集中的分类列值创建虚拟变量列,而新数据集的分类列中不存在这些变量列.
- 最后,使用上述方法删除新数据集中不存在于旧数据集中的任何列-
test_df_reindexed = test_df_onehotencode [train_one_hot_encode_col_list]
- First use
get_dummies()
to one-hot-encode the new data set. - Use pandas
reindex
to bring the new dataframe into the same structure as the one used in model training -df.reindex(columns=train_one_hot_encode_col_list, axis="columns")
. - The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
- Finally, use the above method to remove any columns in the new data set that are not present in the old data set -
test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]
如果执行这些步骤,则可以完全依靠原始列名称的列表,并且无需跟踪列位置或分类值定义.
If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.
我也建议您阅读以下内容以作进一步参考:熊猫一键编码- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html 列重新编制索引- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
I would also advice you to read the below for further reference: One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
这篇关于编码分类变量后如何跟踪列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!