pd.get_dummies是一键编码吗? [英] Is pd.get_dummies one-hot encoding?

查看:72
本文介绍了pd.get_dummies是一键编码吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一键编码和伪编码之间的区别是pandas.get_dummies一键编码方法使用默认参数(例如drop_first=False)编码吗?

Given the difference between one-hot encoding and dummy coding, is the pandas.get_dummies method one-hot encoding when using default parameters (i.e. drop_first=False)?

如果是这样,我从逻辑回归模型中删除截距是否有意义?这是一个示例:

If so, does it make sense that I remove the intercept from the logistic regression model? Here is an example:

# I assume I have already my dataset in a DataFrame X and the true labels in y
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .80)

clf = LogisticRegression(fit_intercept=False)
clf.fit(X_train, y_train)

推荐答案

虚拟变量是每个观察值均为1或0的任何变量.将pd.get_dummies应用于每个观察值具有一个类别的类别的列时,将为每个唯一的类别值生成一个新的列(变量).它将在对应于该观察结果的分类值的列中放置一个.这等效于一种热编码.

Dummies are any variables that are either one or zero for each observation. pd.get_dummies when applied to a column of categories where we have one category per observation will produce a new column (variable) for each unique categorical value. It will place a one in the column corresponding to the categorical value present for that observation. This is equivalent to one hot encoding.

一次热编码的特征是每个观察结果的每组分类值仅包含一个.

One-hot encoding is characterized by having only one one per set of categorical values per observation.

考虑系列s

s = pd.Series(list('AABBCCABCDDEE'))

s

0     A
1     A
2     B
3     B
4     C
5     C
6     A
7     B
8     C
9     D
10    D
11    E
12    E
dtype: object

pd.get_dummies将产生一键编码.是的!绝对不适合拦截器.

pd.get_dummies will produce one-hot encoding. And yes! it is absolutely appropriate to not fit the intercept.

pd.get_dummies(s)

    A  B  C  D  E
0   1  0  0  0  0
1   1  0  0  0  0
2   0  1  0  0  0
3   0  1  0  0  0
4   0  0  1  0  0
5   0  0  1  0  0
6   1  0  0  0  0
7   0  1  0  0  0
8   0  0  1  0  0
9   0  0  0  1  0
10  0  0  0  1  0
11  0  0  0  0  1
12  0  0  0  0  1

但是,如果您s包含其他数据并使用了pd.Series.str.get_dummies

However, if you had s include different data and used pd.Series.str.get_dummies

s = pd.Series('A|B,A,B,B,C|D,D|B,A,B,C,A|D'.split(','))

s

0    A|B
1      A
2      B
3      B
4    C|D
5    D|B
6      A
7      B
8      C
9    A|D
dtype: object

然后get_dummies生成的伪变量不是经过一键编码的,理论上您可以离开截距.

Then get_dummies produces dummy variables that are not one-hot encoded and you could theoretically leave the intercept.

s.str.get_dummies()

   A  B  C  D
0  1  1  0  0
1  1  0  0  0
2  0  1  0  0
3  0  1  0  0
4  0  0  1  1
5  0  1  0  1
6  1  0  0  0
7  0  1  0  0
8  0  0  1  0
9  1  0  0  1

这篇关于pd.get_dummies是一键编码吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆