SPARK ML,朴素贝叶斯分类:高概率prediction一类 [英] SPARK ML, Naive Bayes classifier: high probability prediction for one class

查看:732
本文介绍了SPARK ML,朴素贝叶斯分类:高概率prediction一类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用星火ML来优化朴素贝叶斯多类分类。

Hi I am using Spark ML to optimise a Naive Bayes multi-class classifier.

我有大约300类别和我进行分类的文本文件。
训练集是平衡不够,存在为每个类别约300训练的例子。

I have about 300 categories and I am classifying text documents. The training set is balanced enough and there is about 300 training examples for each category.

所有看起来很好,分类正与看不见的文档可以接受precision。但我注意到,分类的新文档时,很多时候,分类概率很高分配的类别之一(prediction概率几乎等于1),而其他类别获得非常低的概率(接近零)。

All looks good and the classifier is working with acceptable precision on unseen documents. But what I am noticing that when classifying a new document, very often, the classifier assigns a high probability to one of the categories (the prediction probability is almost equal to 1), while the other categories receive very low probabilities (close to zero).

可能是什么原因造成的?

What are the possible reasons for this?

我想补充一点,在SPARK ML有一些所谓的原始prediction当我看着它,我可以看到负数,但他们都或多或少可比的幅度,所以即使与类别高概率有可比原prediction比分,但我在跨preting这个分数发现的困难。

I would like to add that in SPARK ML there is something called "raw prediction" and when I look at it, I can see negative numbers but they have more or less comparable magnitude, so even the category with the high probability has comparable raw prediction score, but I am finding difficulties in interpreting this scores.

推荐答案

让我们开始朴素贝叶斯分类器的一个非常正规的描述。如果的 C 的是一组的所有类及 D 的是文档和 X I 的那些特点,朴素贝叶斯返回:

Lets start with a very informal description of Naive Bayes classifier. If C is a set of all classes and d is a document and xi are the features, Naive Bayes returns:

在这里输入的形象描述

由于 P(D)的是,我们可以简化这

Since P(d) is the same for all classes we can simplify this to

在这里输入的形象描述

其中,

在这里输入的形象描述

由于我们假定特征是有条件独立的(这就是为什么它是幼稚的),我们可以进一步简化这个(与拉普拉斯修正,以避免零)为:

Since we assume that features are conditionally independent (that is why it is naive) we can further simplify this (with Laplace correction to avoid zeros) to:

在这里输入的形象描述

与该前pression问题是,在任何非平凡情况下,它在数值上等于零。为了避免我们使用以下属性:

Problem with this expression is that in any non-trivial case it is numerically equal to zero. To avoid we use following property:

在这里输入的形象描述

和替换为初始条件:

在这里输入的形象描述

这是你为原料的概率值。由于每个元素是负的(在价值的对数(0,1])整体前pression具有负价值。当你自己这些值<一发现href=\"https://github.com/apache/spark/blob/v1.6.1/mllib/src/main/scala/org/apache/spark/ml/classification/NaiveBayes.scala#L195\"相对=nofollow>进一步归,以便最大值等于1和归一化值的总和除以

These are the values you get as the raw probabilities. Since each element is negative (logarithm of the value in (0, 1]) a whole expression has negative value as well. As you discovered by yourself these values are further normalized so the maximum value is equal to 1 and divided by the sum of the normalized values

要注意,而你得到的值是不严格P这与是非常重要的。顺序和比例是完全(忽略可能的数值问题)相同。如果没有其他类获取prediction靠近的意思是,所提供的证据,这是一个非常强大的prediction。因此,它实际上是你想看到的东西。

It is important to note that while values you get are not strictly P(c|d) they preserve all important properties. The order and ratios are exactly (ignoring possible numerical issues) the same. If none other class gets prediction close to one it means that, given the evidence, it is a very strong prediction. So it is actually something you want to see.

这篇关于SPARK ML,朴素贝叶斯分类:高概率prediction一类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆