使用朴素贝叶斯分类器对推文进行分类:一些问题 [英] Using a Naive Bayes Classifier to classify tweets: some problems

查看:172
本文介绍了使用朴素贝叶斯分类器对推文进行分类:一些问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

除其他资源外,在Stackoverflow上的各种文章中,我正在尝试实现自己的PHP分类器,以将推文分类为肯定,中性和否定类.在编码之前,我需要进行处理.我的思路和示例如下:

Using, amongst other sources, various posts here on Stackoverflow, I'm trying to implement my own PHP classier to classify tweets into a positive, neutral and negative class. Before coding, I need to get the process straigt. My train-of-thought and an example are as follows:

                                  p(class) * p(words|class)
 Bayes theorem: p(class|words) =  ------------------------- with
                                           p(words)

 assumption that p(words) is the same for every class leads to calculating
 arg max p(class) * p(words|class) with
 p(words|class) = p(word1|class) * p(word2|topic) * ... and
 p(class) = #words in class / #words in total and

                 p(word, class)                       1
 p(word|class) = -------------- = p(word, class) * -------- =
                    p(class)                       p(class)

 #times word occurs in class    #words in total  #times word occurs in class
 --------------------------- * --------------- = ---------------------------
       #words in total          #words in class        #words in class

 Example: 

 ------+----------------+-----------------+
 class | words          | #words in class |
 ------+----------------+-----------------+
 pos   | happy win nice | 3               |
 neu   | neutral middle | 2               |
 neg   | sad loose bad  | 3               |
 ------+----------------+-----------------+

 p(pos) = 3/8
 p(neu) = 2/8
 p(meg) = 3/8

 Calculate: argmax(sad loose)

 p(sad loose|pos) = p(sad|pos) * p(loose|pos) = (0+1)/3 * (0+1)/3 = 1/9
 p(sad loose|neu) = p(sad|neu) * p(loose|neu) = (0+1)/3 * (0+1)/3 = 1/9
 p(sad loose|neg) = p(sad|neg) * p(loose|neg) =     1/3 *     1/3 = 1/9

 p(pos) * p(sad loose|pos) = 3/8 * 1/9 = 0.0416666667
 p(neu) * p(sad loose|neu) = 2/8 * 1/9 = 0.0277777778
 p(neg) * p(sad loose|neg) = 3/8 * 1/9 = 0.0416666667 <-- should be 100% neg!

如您所见,我已经通过正面(快乐取胜"),中立(中立中立")和负面(难过难过")推文对分类器进行了训练".为了防止由于所有类中缺少一个单词而导致概率为零的问题,我使用了LaPlace(或äddone")平滑,请参见(0 + 1)".

As you can see, I have "trained" the classifier with a positive ("happy win nice"), a neutral ("neutral middle") and a negative ("sad loose bad") tweet. In order to prevent problems of having probabilities of zero because of one word missing in all classes, I'm using LaPlace (or ädd one") smoothing, see "(0+1)".

我基本上有两个问题:

  1. 这是实施的正确蓝图吗?有改进的空间吗?
  2. 对推文进行分类(难过")时,由于它仅包含否定字词,因此在否"类中应为100%.但是,LaPlace平滑使事情变得更加复杂:pos类和neg类具有相同的概率.有解决方法吗?

推荐答案

在推理中有两个主要方面需要改进.

There are two main elements to improve in your reasoning.

首先,您应该改善平滑方法:

First, you should improve your smoothing method:

  • 在应用拉普拉斯平滑时,应将其应用于所有测量,而不仅是分母为零的测量.
  • 此外,这种情况下的拉普拉斯平滑通常由(c + 1)/(N + V)给出,其中V是词汇量(例如,请参见
  • When applying Laplace smoothing, it should be applied to all measurements, not just to those with zero denominator.
  • In addition, Laplace smoothing for such cases is usually given by (c+1)/(N+V), where V is the vocabulary size (e.g., see in Wikipedia).

因此,使用您定义的概率函数(可能不是最合适的函数,请参见下文):

Therefore, using probability function you have defined (which might not be the most suitable, see below):

p(sad loose|pos) = (0+1)/(3+8) * (0+1)/(3+8) = 1/121

p(sad loose|neu) = (0+1)/(3+8) * (0+1)/(3+8) = 1/121

p(sad loose|neg) = (1+1)/(3+8) * (1+1)/(3+8) = 4/121 <-- would become argmax

此外,最常见的概率计算方法是:

In addition, a more common way of calculating the probability in the first place, would be by:

(number of tweets in class containing term c) / (total number of tweets in class)

例如,在上面给出的受限列车组中,不考虑平滑度,p(sad | pos)= 0/1 = 0,p(sad | neg)= 1/1 = 1.这些数字将更有意义.例如如果您有10条否定类推文,其中有4条出现"sad",则p(sad | neg)为4/10.

For instance, in the limited trainset given above, and disregarding smoothing, p(sad|pos) = 0/1 = 0, and p(sad|neg) = 1/1 = 1. When the trainset size increases, the numbers would be more meaningful. e.g. if you had 10 tweets for the negative class, with 'sad' appearing in 4 of them, then p(sad|neg) would have been 4/10.

关于朴素贝叶斯算法输出的实际数字:您不应该期望算法为每个类别分配实际概率;相反,类别顺序更为重要.具体来说,使用argmax将为您提供该类算法的最佳猜测,但不能为您提供概率.将概率分配给NB结果又是另一回事了.例如,请参见文章讨论此问题.

Regarding the actual number outputted by the Naive Bayes algorithm: you shouldn't expect the algorithm to assign actual probability to each class; rather, the category order is of more importance. Concretely, using the argmax would give you the algorithm's best guess for the class, but not the probability for it. Assigning probabilities to NB results is another story; for example, see an article discussing this issue.

这篇关于使用朴素贝叶斯分类器对推文进行分类:一些问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆