使用分类数据作为Sklean LogisticRegression中的特征 [英] Using categorical data as features in sklean LogisticRegression
问题描述
我试图了解如何将分类数据用作sklearn.linear_model
的LogisticRegression
中的功能.
我当然知道我需要对其进行编码.
-
我不了解的是如何将编码特征传递给Logistic回归,以便将其作为分类特征处理,而不是将编码为标准可量化特征时得到的int值解释为
-
(不太重要)是否有人可以解释使用
preprocessing.LabelEncoder()
,DictVectorizer.vocabulary
还是仅使用简单的dict自己编码分类数据之间的区别? Alex A.在这里的评论涉及主题,但不是很深.
特别是第一个!
您可以为不同类别创建指标变量.例如:
animal_names = {'mouse';'cat';'dog'}
Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')
那么我们有:
[0 [0
Indicator_cat = 1 Indicator_dog = 0
0] 1]
您可以将它们连接到原始数据矩阵上:
X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]
请记住,如果数据矩阵中包含常数项,则不要留下一个指示符!否则,您的数据矩阵将不会是完整的列级(或者从计量经济学角度来说,您具有多重共线性).
[1 1 0 0 Notice how constant term, an indicator for mouse,
1 0 1 0 an indicator for ca,t and an indicator for dog
1 0 0 1] leads to a less than full column rank matrix:
the first column is the sum of the last three.
I'm trying to understand how to use categorical data as features in sklearn.linear_model
's LogisticRegression
.
I understand of course I need to encode it.
What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.
(Less important) Can somebody explain the difference between using
preprocessing.LabelEncoder()
,DictVectorizer.vocabulary
or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.
Especially with the first one!
You can create indicator variables for different categories. For example:
animal_names = {'mouse';'cat';'dog'}
Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')
Then we have:
[0 [0
Indicator_cat = 1 Indicator_dog = 0
0] 1]
And you can concatenate these onto your original data matrix:
X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]
Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).
[1 1 0 0 Notice how constant term, an indicator for mouse,
1 0 1 0 an indicator for ca,t and an indicator for dog
1 0 0 1] leads to a less than full column rank matrix:
the first column is the sum of the last three.
这篇关于使用分类数据作为Sklean LogisticRegression中的特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!