使用分类数据作为Sklean LogisticRegression中的特征 [英] Using categorical data as features in sklean LogisticRegression

查看：84 发布时间：2020/5/4 3:18:12 python scikit-learn regression data-modeling logistic-regression

本文介绍了使用分类数据作为Sklean LogisticRegression中的特征的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图了解如何将分类数据用作sklearn.linear_model的LogisticRegression中的功能.

我当然知道我需要对其进行编码.

我不了解的是如何将编码特征传递给Logistic回归，以便将其作为分类特征处理，而不是将编码为标准可量化特征时得到的int值解释为
(不太重要)是否有人可以解释使用preprocessing.LabelEncoder()，DictVectorizer.vocabulary还是仅使用简单的dict自己编码分类数据之间的区别? Alex A.在这里的评论涉及主题，但不是很深.

特别是第一个！

解决方案

您可以为不同类别创建指标变量.例如:

animal_names = {'mouse';'cat';'dog'}

Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')

那么我们有:

                [0                         [0
Indicator_cat =  1        Indicator_dog =   0
                 0]                         1]

您可以将它们连接到原始数据矩阵上:

X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]

请记住，如果数据矩阵中包含常数项，则不要留下一个指示符！否则，您的数据矩阵将不会是完整的列级(或者从计量经济学角度来说，您具有多重共线性).

[1  1  0  0         Notice how constant term, an indicator for mouse,
 1  0  1  0         an indicator for ca,t and an indicator for dog
 1  0  0  1]        leads to a less than full column rank matrix:
                    the first column is the sum of the last three.

I'm trying to understand how to use categorical data as features in sklearn.linear_model's LogisticRegression.

I understand of course I need to encode it.

What I don't understand is how to pass the encoded feature to the Logistic regression so it's processed as a categorical feature, and not interpreting the int value it got when encoding as a standard quantifiable feature.
(Less important) Can somebody explain the difference between using preprocessing.LabelEncoder(), DictVectorizer.vocabulary or just encoding the categorical data yourself with a simple dict? Alex A.'s comment here touches on the subject but not very deeply.

Especially with the first one!

解决方案

You can create indicator variables for different categories. For example:

animal_names = {'mouse';'cat';'dog'}

Indicator_cat = strcmp(animal_names,'cat')
Indicator_dog = strcmp(animal_names,'dog')

Then we have:

                [0                         [0
Indicator_cat =  1        Indicator_dog =   0
                 0]                         1]

And you can concatenate these onto your original data matrix:

X_with_indicator_vars = [X, Indicator_cat, Indicator_dog]

Remember though to leave one category without an indicator if a constant term is included in the data matrix! Otherwise, your data matrix won't be full column rank (or in econometric terms, you have multicollinearity).

[1  1  0  0         Notice how constant term, an indicator for mouse,
 1  0  1  0         an indicator for ca,t and an indicator for dog
 1  0  0  1]        leads to a less than full column rank matrix:
                    the first column is the sum of the last three.

这篇关于使用分类数据作为Sklean LogisticRegression中的特征的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用分类数据作为Sklean LogisticRegression中的特征 [英] Using categorical data as features in sklean LogisticRegression

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用分类数据作为Sklean LogisticRegression中的特征 [英] Using categorical data as features in sklean LogisticRegression

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭