指定 pandas 的get_dummies可能值列表 [英] Specify list of possible values for Pandas get_dummies

查看:221
本文介绍了指定 pandas 的get_dummies可能值列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个类似下面的Pandas DataFrame,并且我正在对categorical_1进行编码,以便在scikit-learn中进行训练:

Suppose I have a Pandas DataFrame like the below and I'm encoding categorical_1 for training in scikit-learn:

data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9], 
        'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'])

'categorical_1'的值是A,B或C,因此最终在dummy_values中有3列.但是,categorical_1实际上可以采用值A,B,C,D或E,因此没有表示值D或E的列.

The values for 'categorical_1' are A, B, or C so I end up with 3 columns in dummy_values. However, categorical_1 can in reality take on values A, B, C, D, or E so there is no column represented for values D or E.

在R中,我会在考虑该列时指定级别-是否有相应的方法可以对Pandas执行此操作,还是需要手动处理?

In R I would specify levels when factoring that column - is there a corresponding way to do this with Pandas or would I need to handle that manually?

在我看来,这是必要的,它需要用训练集中使用的值之外的那个列的值来说明测试数据,但是作为机器学习的新手,也许这不是必需的,所以我很乐意不同的方法来解决这个问题.

In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.

推荐答案

首先,如果您希望熊猫获取更多值,只需将它们添加到发送给get_dummies方法的列表中

First, if you want pandas to take more values simply add them to the list sent to the get_dummies method

data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9], 
        'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'] + ['D','E'])

与列表中的python +中的

一样,可作为concatenate操作,因此

as in python + on lists works as a concatenate operation, so

['A','B','C','B','B'] + ['D','E']

产生

['A', 'B', 'C', 'B', 'B', 'D', 'E']

在我看来,这是必要的,它需要用训练集中使用的值之外的那个列的值来说明测试数据,但是作为机器学习的新手,也许这不是必需的,所以我很乐意不同的方法来解决这个问题.

In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.

从机器学习的角度来看,这是非常多余的.该列属于分类列,因此值"D"对模型完全没有意义,这是以前从未见过的.如果您要对一元特征进行编码(在为每个值创建列后我都会假设),那么只需用

From the machine learning perspective, it is quite redundant. This column is a categorical one, so value 'D' means completely nothing to the model, that never seen it before. If you are coding the features unary (which I assume after seeing that you create columns for each value) it is enough to simply represent these 'D', 'E' values with

A   B   C
0   0   0

(我假设您用0 1 0代表'B'值,用0 0 1代表'C'等)

(i assume that you represent the 'B' value with 0 1 0, 'C' with 0 0 1 etc.)

因为在训练过程中训练集中没有这样的值,所以没有模型可以区分给出值"D"或大象"

because if there were no such values in the training set, during testing - no model will distinguish between giving value 'D', or 'Elephant'

执行此操作的唯一原因是假设,将来您希望添加具有'D'值的数据,并且只是不想修改代码,那么即使现在就这样做也是合理的它可能会使训练更加复杂(当您添加一个目前为止的维度-完全不了解任何知识时),但这似乎是一个小问题.

The only reason for such action would be to assume, that in the future you wish to add data with 'D' values, and simply do not want to modify the code, then it is reasonable to do it now, even though it could make training a bit more complex (as you add a dimension that as for now - carries completely no knowledge), but it seems a small problem.

如果您不打算以一元格式对其进行编码,而是想将这些值作为一个功能(仅使用分类值),那么就根本不需要创建这些虚拟对象"并使用模型可以使用这样的值,例如朴素贝叶斯(Naive Bayes),可以简单地通过拉普拉斯平滑"进行训练,从而能够解决不存在的值.

If you are not going to encode it in the unary format, but rather want to use these values as one feature, simply with categorical values, then you would not need to create these "dummies" at all, and use a model which can work with such values, such as Naive Bayes, which could simply be trained with "Laplacian smoothing" to be able to work around non-existing values.

这篇关于指定 pandas 的get_dummies可能值列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆