使用分类数据作为Sklean LogisticRegression中的特征 [英] Using categorical data as features in sklean LogisticRegression

查看:84
本文介绍了使用分类数据作为Sklean LogisticRegression中的特征的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何将分类数据用作sklearn.linear_modelLogisticRegression中的功能.

我当然知道我需要对其进行编码.

  1. 我不了解的是如何将编码特征传递给Logistic回归,以便将其作为分类特征处理,而不是将编码为标准可量化特征时得到的int值解释为

  2. (不太重要)是否有人可以解释使用preprocessing.LabelEncoder()DictVectorizer.vocabulary还是仅使用简单的dict自己编码分类数据之间的区别? Alex A.在这里的评论涉及主题,但不是很深.

特别是第一个!

解决方案

您可以为不同类别创建指标变量.例如:

animal_names = {'mouse';'cat';'dog'}

Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')

那么我们有:

                [0                         [0
Indicator_cat =  1        Indicator_dog =   0
                 0]                         1]

您可以将它们连接到原始数据矩阵上:

X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]


请记住,如果数据矩阵中包含常数项,则不要留下一个指示符!否则,您的数据矩阵将不会是完整的列级(或者从计量经济学角度来说,您具有多重共线性).

[1  1  0  0         Notice how constant term, an indicator for mouse,
 1  0  1  0         an indicator for ca,t and an indicator for dog
 1  0  0  1]        leads to a less than full column rank matrix:
                    the first column is the sum of the last three.

I'm trying to understand how to use categorical data as features in sklearn.linear_model's LogisticRegression.

I understand of course I need to encode it.

  1. What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.

  2. (Less important) Can somebody explain the difference between using preprocessing.LabelEncoder(), DictVectorizer.vocabulary or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.

Especially with the first one!

解决方案

You can create indicator variables for different categories. For example:

animal_names = {'mouse';'cat';'dog'}

Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')

Then we have:

                [0                         [0
Indicator_cat =  1        Indicator_dog =   0
                 0]                         1]

And you can concatenate these onto your original data matrix:

X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]


Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).

[1  1  0  0         Notice how constant term, an indicator for mouse,
 1  0  1  0         an indicator for ca,t and an indicator for dog
 1  0  0  1]        leads to a less than full column rank matrix:
                    the first column is the sum of the last three.

这篇关于使用分类数据作为Sklean LogisticRegression中的特征的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆