如何将sklearn.naive_bayes与(多个)分类功能一起使用? [英] How can I use sklearn.naive_bayes with (multiple) categorical features?

查看:146
本文介绍了如何将sklearn.naive_bayes与(多个)分类功能一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想学习一个朴素贝叶斯模型来解决类为布尔值(采用两个值之一)的问题.一些功能是布尔型的,但其他功能是分类的,可以采用少量值(〜5).

如果我所有的功能都是布尔值,那么我想使用sklearn.naive_bayes.BernoulliNB.显然sklearn.naive_bayes.MultinomialNB不是我想要的 .

一种解决方案是将我的分类特征拆分为布尔特征.例如,如果变量"X"具有值"red","green","blue",则我可以具有三个变量:"X是红色","X是绿色","X是蓝色".这违反了给定类的变量的条件独立性的假设,因此似乎完全不合适.

另一种可能性是将变量编码为实值变量,其中0.0表示红色,1.0表示绿色,而2.0表示蓝色.使用GaussianNB似乎也完全不合适(出于明显的原因).

我正在尝试做的事情似乎并不奇怪,但是我不明白如何将其应用到sklearn给我的朴素贝叶斯模型中.自己编写代码很容易,但是出于明显的原因,我更愿意使用sklearn(大多数:是为了避免错误).

:

我的理解是,在多项式NB中,特征向量包含在k iid样本中观察到令牌的次数的计数.

我的理解是,这适用于存在基础文档类别的分类文档,然后假定文档中的每个单词均来自特定于该类别的分类分布.一个文档将具有k个标记,特征向量的长度将等于词汇量,并且特征计数的总和将为k.

就我而言,我有许多bernoulli变量,以及几个分类变量.但是这里没有计数"的概念.

示例:班级是喜欢或不喜欢数学的人.预测变量是大学专业(分类),是否进入研究生院(布尔值).

我认为这不适合多项式,因为这里没有计数.

解决方案

某些功能是布尔型的,但其他功能是分类的,可以使用少量值(〜5).

这是一个有趣的问题,但实际上不仅仅是一个问题:

  1. 如何处理NB中的分类功能.
  2. 如何处理NB中的非均质特征(而且,正如我将在下文中指出的,甚至两个分类特征也是非均质的).
  3. 如何在sklearn中执行此操作.


首先考虑一个单个分类特征. NB假定/简化功能是独立的.您将其转换为几个二进制变量的想法正是虚拟变量的想法.显然,这些虚拟变量不是独立变量.然后对结果运行Bernoulli NB的想法隐含了独立性.虽然众所周知,实际上,NB在遇到因变量时不一定会损坏,但没有理由尝试将问题转换为NB的最差配置,尤其是因为多项式NB是非常容易的选择.

相反,假设在使用虚拟变量将单个类别变量转换为多列数据集之后,您将使用多项式NB.多项式NB的理论状态:

在多项式事件模型中,样本(特征向量)表示通过多项式...生成某些事件的频率...其中p i是事件i发生的概率.然后,特征向量...是直方图,其中x i {\ displaystyle x_ {i}} x_ {i}计算在特定实例中观察到事件i的次数.这是通常用于文档分类的事件模型,其中事件表示单个文档中某个单词的出现(请参见单词假设).

因此,在这里,您的单个类别变量的每个实例都是一个长度为1的段落",并且分布恰好是多项式.具体来说,每一行在一个位置上具有1个位置,在其余所有位置上具有0个位置,因为长度为1的段落必须恰好具有一个单词,因此这些就是频率.

请注意,从sklearn的多项式NB的角度来看,数据集是5列的事实,现在并不意味着独立.


现在考虑您的数据集包含多个要素的情况:

  1. 类别
  2. 伯努利
  3. 普通

在使用NB的前提下,这些变量是独立的.因此,您可以执行以下操作:

  1. 使用您的虚拟变量和多项式NB,分别为每个分类数据构建NB分类器.
  2. 立即为伯努利数据的全部建立一个NB分类器-这是因为sklearn的伯努利NB只是几个单一功能伯努利NB的快捷方式.
  3. 所有正常功能都与2相同.

根据独立性的定义,实例的概率是这些分类器的实例概率的乘积.

I want to learn a Naive Bayes model for a problem where the class is boolean (takes on one of two values). Some of the features are boolean, but other features are categorical and can take on a small number of values (~5).

If all my features were boolean then I would want to use sklearn.naive_bayes.BernoulliNB. It seems clear that sklearn.naive_bayes.MultinomialNB is not what I want.

One solution is to split up my categorical features into boolean features. For instance, if a variable "X" takes on values "red", "green", "blue", I can have three variables: "X is red", "X is green", "X is blue". That violates the assumption of conditional independence of the variables given the class, so it seems totally inappropriate.

Another possibility is to encode the variable as a real-valued variable where 0.0 means red, 1.0 means green, and 2.0 means blue. That also seems totally inappropriate to use GaussianNB (for obvious reasons).

What I'm trying to do doesn't seem weird, but I don't understand how to fit it into the Naive Bayes models that sklearn gives me. It's easy to code up myself, but I prefer to use sklearn if possible for obvious reasons (most: to avoid bugs).

[Edit to explain why I don't think multinomial NB is what I want]:

My understanding is that in multinomial NB the feature vector consists of counts of how many times a token was observed in k iid samples.

My understanding is that this is a fit for document of classification where there is an underlying class of document, and then each word in the document is assumed to be drawn from a categorical distribution specific to that class. A document would have k tokens, the feature vector would be of length equal to the vocabulary size, and the sum of the feature counts would be k.

In my case, I have a number of bernoulli variables, plus a couple categorical ones. But there is no concept of the "counts" here.

Example: classes are people who like or don't like math. Predictors are college major (categorical) and whether they went to graduate school (boolean).

I don't think this fits multinomial since there are no counts here.

解决方案

Some of the features are boolean, but other features are categorical and can take on a small number of values (~5).

This is an interesting question, but it is actually more than a single one:

  1. How to deal with a categorical feature in NB.
  2. How to deal with non-homogeneous features in NB (and, as I'll point out in the following, even two categorical features are non-homogeneous).
  3. How to do this in sklearn.


Consider first a single categorical feature. NB assumes/simplifies that the features are independent. Your idea of transforming this into several binary variables is exactly that of dummy variables. Clearly, these dummy variables are anything but independent. Your idea of then running a Bernoulli NB on the result implicitly assumes independence. While it is known that, in practice, NB does not necessarily break when faced with dependent variables, there is no reason to try to transform the problem into the worst configuration for NB, especially as multinomial NB is a very easy alternative.

Conversely, suppose that after transforming the single categorical variable into a multi-column dataset using the dummy variables, you use a multinomial NB. The theory for multinomial NB states:

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial ... where p i is the probability that event i occurs. A feature vector ... is then a histogram, with x i {\displaystyle x_{i}} x_{i} counting the number of times event i was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption).

So, here, each instance of your single categorical variable is a "length-1 paragraph", and the distribution is exactly multinomial. Specifically, each row has 1 in one position and 0 in all the rest because a length-1 paragraph must have exactly one word, and so those will be the frequencies.

Note that from the point of view of sklearn's multinomial NB, the fact that the dataset is 5-columned, does not now imply an assumption of independence.


Now consider the case where you have a dataset consisting of several features:

  1. Categorical
  2. Bernoulli
  3. Normal

Under the very assumption of using NB, these variables are independent. Consequently, you can do the following:

  1. Build a NB classifier for each of the categorical data separately, using your dummy variables and a multinomial NB.
  2. Build a NB classifier for all of the Bernoulli data at once - this is because sklearn's Bernoulli NB is simply a shortcut for several single-feature Bernoulli NBs.
  3. Same as 2 for all the normal features.

By the definition of independence, the probability for an instance, is the product of the probabilities of instances by these classifiers.

这篇关于如何将sklearn.naive_bayes与(多个)分类功能一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆