编码分类变量后如何跟踪列? [英] How to keep track of columns after encoding categorical variables?

查看：65 发布时间：2021/4/21 19:47:20 python machine-learning scikit-learn categorical-data one-hot-encoding

本文介绍了编码分类变量后如何跟踪列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道一旦对数据集进行数据预处理，如何跟踪数据集的原始列?

I am wondering how I can keep track of the original columns of a dataset once I perform data preprocessing on it?

在下面的代码中， df_columns 会告诉我， df_array 中的 0 列是 A ，而列> 1 是 B ，依此类推...

In the below code df_columns would tell me that column 0 in df_array is A, column 1 is B and so forth...

但是，一旦我对分类列 B df_columns 进行编码时，不再对跟踪 df_dummies

However when once I encode categorical column B df_columns is no longer valid for keeping track of df_dummies

import pandas as pd
import numpy as np

animal = ['dog','cat','horse']

df = pd.DataFrame({'A': np.random.rand(9),
                   'B': [animal[np.random.randint(3)] for i in range(9)],
                   'C': np.random.rand(9),
                   'D': np.random.rand(9)})

df_array = df.values
df_columns = df.columns

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer([('encoder', OneHotEncoder(), [1])], remainder='passthrough')
df_dummies = np.array(ct.fit_transform(df_array), dtype=np.float)

该解决方案应该与分类列的位置无关...是 A ， B ， C 或D .我可以做艰苦的工作，并不断更新 df_columns 词典...但是它不是优雅或"pythonic"

The solution should be agnostic of the position of the categorical column... be it A, B, C or D. I can do the grunt work and keep updating the df_columns dictionary... but it wouldn't be elegant or "pythonic"

此外...解决方案将如何跟踪类别的含义?{0,0,1}是猫，{0,1,0}是狗，依此类推?

Furthermore... how would the solution look to keep track of what the categoricals mean? {0,0,1} would be cat, {0,1,0} would be dog and so on?

PS-我知道虚拟变量陷阱，在我实际使用它来训练模型时会使用 df_dummies [:，1:] .

PS - I am aware of the dummy variable trap and will take df_dummies[:,1:] when I actually use it to train my model.

推荐答案

您能否确认将来的数据集是否将继续具有相同的列名?如果我的问题正确无误，您要做的就是从原始数据框中保存 df_columns 并使用它为新数据框重新编制索引.

Can you confirm if future data sets will continue to have the same column names? If I got your question correctly, all that you will need to do is save df_columns from the original data frame and use it to reindex your new dataframe.

new_df_reindexed = new_df[df_columns]

要回答其他问题，您可以使用熊猫的 get_dummies()对数据进行一键编码.使用 drop_first 参数删除生成的列值之一，并避免使用哑变量陷阱.另外，保存一帧热编码数据帧的列列表.

To answer your other questions, you can one-hot encode your data using get_dummies() from pandas. Use the drop_first parameter to drop one of the generated column values and avoid the dummy variable trap. Also, save the column list of the one-hot-encoded data frame.

为确保新的/测试/保持数据集与模型训练中使用的列定义相同，

To ensure that you new / testing / holdout data set has the same column definition as that used in model training,

首先使用 get_dummies()对新数据集进行一次热编码.
使用熊猫 reindex 将新数据框带入与模型训练中使用的数据框相同的结构- df.reindex(columns = train_one_hot_encode_col_list，axis ="columns").
上面将为训练数据集中的分类列值创建虚拟变量列，而新数据集的分类列中不存在这些变量列.
最后，使用上述方法删除新数据集中不存在于旧数据集中的任何列- test_df_reindexed = test_df_onehotencode [train_one_hot_encode_col_list]

First use get_dummies() to one-hot-encode the new data set.
Use pandas reindex to bring the new dataframe into the same structure as the one used in model training - df.reindex(columns=train_one_hot_encode_col_list, axis="columns").
The above will create dummy variable columns for categorical column values in the training data set that are not present in the categorical columns of the new data set.
Finally, use the above method to remove any columns in the new data set that are not present in the old data set - test_df_reindexed = test_df_onehotencode[train_one_hot_encode_col_list]

如果执行这些步骤，则可以完全依靠原始列名称的列表，并且无需跟踪列位置或分类值定义.

If you follow these steps, you can completely rely on the list of original column names, and will not need to track column positions or categorical value definitions.

我也建议您阅读以下内容以作进一步参考:熊猫一键编码- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html 列重新编制索引- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

I would also advice you to read the below for further reference: One-hot encoding in pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html Column re-indexing - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html

这篇关于编码分类变量后如何跟踪列?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

编码分类变量后如何跟踪列? [英] How to keep track of columns after encoding categorical variables?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

编码分类变量后如何跟踪列? [英] How to keep track of columns after encoding categorical variables?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭