编码分类变量后如何跟踪列? [英] How to keep track of columns after encoding categorical variables?

查看:65
本文介绍了编码分类变量后如何跟踪列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道一旦对数据集进行数据预处理,如何跟踪数据集的原始列?

I am wondering how I can keep track of the original columns of a dataset once I perform data preprocessing on it?

在下面的代码中, df_columns 会告诉我, df_array 中的 0 列是 A ,而列> 1 B ,依此类推...

In the below code df_columns would tell me that column 0 in df_array is A, column 1 is B and so forth...

但是,一旦我对分类列 B df_columns 进行编码时,不再对跟踪 df_dummies

However when once I encode categorical column B df_columns is no longer valid for keeping track of df_dummies

import pandas as pd
import numpy as np

animal = ['dog','cat','horse']

df = pd.DataFrame({'A': np.random.rand(9),
                   'B': [animal[np.random.randint(3)] for i in range(9)],
                   'C': np.random.rand(9),
                   'D': np.random.rand(9)})

df_array = df.values
df_columns = df.columns

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
df_dummies = np.array(ct.fit_transform(df_array), dtype=np.float)

该解决方案应该与分类列的位置无关...是 A B C D .我可以做艰苦的工作,并不断更新 df_columns 词典...但是它不是优雅或"pythonic"

The solution should be agnostic of the position of the categorical column... be it A, B, C or D. I can do the grunt work and keep updating the df_columns dictionary... but it wouldn't be elegant or "pythonic"

此外...解决方案将如何跟踪类别的含义?{0,0,1}是猫,{0,1,0}是狗,依此类推?

Furthermore... how would the solution look to keep track of what the categoricals mean? {0,0,1} would be cat, {0,1,0} would be dog and so on?

PS-我知道虚拟变量陷阱,在我实际使用它来训练模型时会使用 df_dummies [:,1:] .

PS - I am aware of the dummy variable trap and will take df_dummies[:,1:] when I actually use it to train my model.

推荐答案

您能否确认将来的数据集是否将继续具有相同的列名?如果我的问题正确无误,您要做的就是从原始数据框中保存 df_columns 并使用它为新数据框重新编制索引.

Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns from the original data frame and use it to reindex your new dataframe.

new_df_reindexed = new_df[df_columns]

要回答其他问题,您可以使用熊猫的 get_dummies()对数据进行一键编码.使用 drop_first 参数删除生成的列值之一,并避免使用哑变量陷阱.另外,保存一帧热编码数据帧的列列表.

To answer your other questions, you can one-hot encode your data using get_dummies() from pandas. Use the drop_first parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.

为确保新的/测试/保持数据集与模型训练中使用的列定义相同,

To ensure that you new / testing / holdout data set has the same column definition as that used in model training,

  • 首先使用 get_dummies()对新数据集进行一次热编码.
  • 使用熊猫 reindex 将新数据框带入与模型训练中使用的数据框相同的结构- df.reindex(columns = train_one_hot_encode_col_list,axis ="columns").
  • 上面将为训练数据集中的分类列值创建虚拟变量列,而新数据集的分类列中不存在这些变量列.
  • 最后,使用上述方法删除新数据集中不存在于旧数据集中的任何列- test_df_reindexed = test_df_onehotencode [train_one_hot_encode_col_list]
  • First use get_dummies() to one-hot-encode the new data set.
  • Use pandas reindex to bring the new dataframe into the same structure as the one used in model training - df.reindex(columns=train_one_hot_encode_col_list, axis="columns").
  • The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
  • Finally, use the above method to remove any columns in the new data set that are not present in the old data set - test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]

如果执行这些步骤,则可以完全依靠原始列名称的列表,并且无需跟踪列位置或分类值定义.

If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.

我也建议您阅读以下内容以作进一步参考:熊猫一键编码- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html 列重新编制索引- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

I would also advice you to read the below for further reference: One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

这篇关于编码分类变量后如何跟踪列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆