在朴素贝叶斯垃圾邮件过滤中结合个体概率 [英] Combining individual probabilities in Naive Bayesian spam filtering

查看:60
本文介绍了在朴素贝叶斯垃圾邮件过滤中结合个体概率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试通过分析我积累的语料库来生成垃圾邮件过滤器.

I'm currently trying to generate a spam filter by analyzing a corpus I've amassed.

我正在使用维基百科条目http://en.wikipedia.org/wiki/Bayesian_spam_filtering 开发我的分类代码.

I'm using the wikipedia entry http://en.wikipedia.org/wiki/Bayesian_spam_filtering to develop my classification code.

我已经通过实现以下来自维基的公式来实现代码来计算邮件是垃圾邮件的概率,因为它包含一个特定的词:

I've implemented code to calculate probability that a message is spam given that it contains a specific word by implementing the following formula from the wiki:

我的 PHP 代码:

public function pSpaminess($word)
{
    $ps = $this->pContentIsSpam();
    $ph = $this->pContentIsHam();
    $pws = $this->pWordInSpam($word);
    $pwh = $this->pWordInHam($word);
    $psw = ($pws * $ps) / ($pws * $ps + $pwh * $ph);
    return $psw;
}

根据组合单个概率部分,我实现了代码来组合测试邮件中所有唯一词的概率以确定垃圾邮件.

In accordance with the Combining individual probabilities section, I've implemented code to combine the probabilities of all the unique words in a test message to determine spaminess.

来自维基公式:

我的 PHP 代码:

public function predict($content)
{
    $words = $this->tokenize($content);
    $pProducts = 1;
    $pSums = 1;
    foreach($words as $word)
    {
        $p = $this->pSpaminess($word);
        echo "$word: $p\n";
        $pProducts *= $p;
        $pSums *= (1 - $p);
    }
    return $pProducts / ($pProducts + $pSums);
}

在测试字符串这根本不是很糟糕."上,产生以下输出:

On a test string "This isn't very bad at all.", the following output is produced:

C:\projects\bayes>php test.php
this: 0.19907407407407
isn't: 0.23
very: 0.2
bad: 0.2906976744186
at: 0.17427385892116
all: 0.16098484848485
probability message is spam: float(0.00030795502523944)

这是我的问题:我是否正确地实现了单个概率的组合?假设我生成了有效的单个单词概率,组合方法是否正确?

Here's my question: Am I implementing the combining individual probabilities correctly? Assuming I'm generating valid individual word probabilities, is the combination method correct?

我担心的是计算的结果概率非常小.我已经在更大的测试消息上对其进行了测试,最终得出的结果概率为科学记数法,其中包含 10 个以上的零位.我期望值在 10 或 100 位.

My concern is the really small resultant probability of the calculation. I've tested it on a larger test message and ended up with a resulting probability in scientific notation with more than 10 places of zeroes. I was expecting values in the 10s or 100ths places.

我希望问题出在我的 PHP 实现上——但是当我检查维基百科的组合函数时,公式的红利是分数的乘积.我不明白多个概率的组合最终会如何超过 0.1% 的概率.

I'm hoping the problem lies in my PHP implementation--but when I examine the combination function from wikipedia the formula's dividend is a product of fractions. I don't see how a combination of multiple probabilities would end up being even more than .1% probability.

如果是这样,消息越长,概率分数越低,我如何补偿垃圾邮件配额以正确预测小型和大型测试用例的垃圾邮件/火腿?

If it is the case, such that the longer the message the lower the probability score will be, how do I compensate the spaminess quota to correctly predict spam/ham for small and large test cases?

附加信息

我的语料库实际上是大约 4 万条 reddit 评论的集合.我实际上是对这些评论应用我的垃圾邮件过滤器".我根据反对票和赞成票的数量将单个评论评为垃圾邮件/火腿:如果赞成票少于反对票,则视为垃圾邮件,否则视为垃圾邮件.

My corpus is actually a collection of about 40k reddit comments. I'm actually applying my "spam filter" against these comments. I'm rating an individual comment as spam/ham based on the number of down votes to up votes: If up votes is less than down votes it is considered Ham, otherwise Spam.

现在,由于语料库类型的原因,实际上很少有单词在垃圾邮件中比在火腿中使用得更多.即,这里列出了在垃圾邮件中出现频率高于 ham 的前十名.

Now, because of the corpus type it turns out there are actually few words that are used in spam more so than in ham. Ie, here is a top ten list of words that appear in spam more often than ham.

+-----------+------------+-----------+
| word      | spam_count | ham_count |
+-----------+------------+-----------+
| krugman   |         30 |        27 |
| fetus     |       12.5 |       7.5 |
| boehner   |         12 |        10 |
| hatred    |       11.5 |       5.5 |
| scum      |         11 |        10 |
| reserve   |         11 |        10 |
| incapable |        8.5 |       6.5 |
| socalled  |        8.5 |       5.5 |
| jones     |        8.5 |       7.5 |
| orgasms   |        8.5 |       7.5 |
+-----------+------------+-----------+

相反,大多数词在火腿中的使用量比火腿更多.以我的垃圾邮件数量最高的前 10 个单词列表为例.

On the contrary, most words are used in great abundance in ham more so than ham. Take for instance, my top 10 list of words with highest spam count.

+------+------------+-----------+
| word | spam_count | ham_count |
+------+------------+-----------+
| the  |       4884 |     17982 |
| to   |     4006.5 |   14658.5 |
| a    |     3770.5 |   14057.5 |
| of   |     3250.5 |   12102.5 |
| and  |       3130 |     11709 |
| is   |     3102.5 |   11032.5 |
| i    |     2987.5 |   10565.5 |
| that |     2953.5 |   10725.5 |
| it   |       2633 |      9639 |
| in   |     2593.5 |    9780.5 |
+------+------------+-----------+

如您所见,垃圾邮件的使用频率明显低于火腿的使用频率.在我 4 万条评论的语料库中,有 2100 条评论被视为垃圾邮件.

As you can see, frequency of spam usage is significantly less than ham usage. In my corpus of 40k comments 2100 comments are considered spam.

如下所述,帖子中的测试短语认为垃圾邮件率如下:

As suggested below, a test phrase on a post considered spam rates as follows:

词组

Cops are losers in general. That's why they're cops.

分析:

C:\projects\bayes>php test.php
cops: 0.15833333333333
are: 0.2218958611482
losers: 0.44444444444444
in: 0.20959269435914
general: 0.19565217391304
that's: 0.22080730418068
why: 0.24539170506912
they're: 0.19264544456641
float(6.0865969793861E-5)

据此,这是垃圾邮件的可能性极低.但是,如果我现在分析一个火腿评论:

According to this, there is an extremely low probability that this is spam. However, if I were to now analyze a ham comment:

词组

Bill and TED's excellent venture?

分析

C:\projects\bayes>php test.php
bill: 0.19534050179211
and: 0.21093065570456
ted's: 1
excellent: 0.16091954022989
venture: 0.30434782608696
float(1)

好的,这很有趣.我在编写此更新时正在做这些示例,因此这是我第一次看到此特定测试用例的结果.我认为我的预测是颠倒的.它实际上挑选了 Ham 而不是 Spam 的概率.这值得验证.

Okay, this is interesting. I'm doing these examples as I'm composing this update so this is the first time I've seen the result for this specific test case. I think my prediction is inverted. Its actually picking out the probability of Ham instead of Spam. This deserves validation.

对已知火腿的新测试.

词组

Complain about $174,000 salary being too little for self.  Complain about $50,000 a year too much for teachers.
Scumbag congressman.

分析

C:\projects\bayes>php test.php
complain: 0.19736842105263
about: 0.21896031561847
174: 0.044117647058824
000: 0.19665809768638
salary: 0.20786516853933
being: 0.22011494252874
too: 0.21003236245955
little: 0.21134020618557
for: 0.20980452359022
self: 0.21052631578947
50: 0.19245283018868
a: 0.21149315683195
year: 0.21035386631717
much: 0.20139771283355
teachers: 0.21969696969697
scumbag: 0.22727272727273
congressman: 0.27678571428571
float(3.9604152477223E-11)

不幸的是没有.原来这是一个巧合的结果.我开始怀疑评论是否不能那么容易量化.也许坏评论的性质与垃圾邮件的性质差别太大.

Unfortunately no. Turns out it was a coincidental result. I'm starting to wonder if perhaps comments can't be so easily quantified. Perhaps the nature of a bad comment is too vastly different than the nature of a spam message.

也许垃圾邮件过滤仅在您拥有特定词类的垃圾邮件时才起作用?

Perhaps it may be the case that spam filtering only works when you have a specific word class of spam messages?

最终更新

正如回复中所指出的,奇怪的结果是由于语料库的性质造成的.在没有明确定义垃圾邮件贝叶斯分类的情况下使用评论语料库是无效的.由于任何一条评论都有可能(并且很可能)同时收到不同用户的垃圾评论和火腿评级,因此不可能为垃圾评论生成硬分类.

As pointed out in the replies, the weird results were due to the nature of the corpus. Using a comment corpus where there is not a an explicit definition of spam Bayesian classification does not perform. Since it is possible (and likely) that any one comment may receive both spam and ham ratings by various users it is not possible to generate a hard classification for spam comments.

最终,我想生成一个评论分类器,该分类器可以根据调整到评论内容的贝叶斯分类来确定评论帖子是否会影响业力.我可能仍然会研究调整分类器以发送垃圾邮件,看看这样的分类器是否可以猜测评论系统的业力响应.但就目前而言,这个问题已经有了答案.感谢大家的投入.

Ultimately, I wanted to generate a comment classifier that could determine if a comment post would garnish karma based on a bayesian classification tuned to comment content. I may still investigate tuning the classifier to email spam messages and see if such a classifier can guess at karma response for comment systems. But for now, the question is answered. Thank you all for your input.

推荐答案

仅使用计算器进行更改,对于您发布的非垃圾邮件短语似乎没问题.在这种情况下,您的 $pProducts 比 $pSums 小几个数量级.

Varifying with only the calculator, it seems ok for the non-spam phrase you posted. In that case you have $pProducts a couple order of magnitudes smaller than $pSums.

尝试从垃圾邮件文件夹中运行一些真正的垃圾邮件,在那里您会遇到 0.8 之类的概率.猜猜为什么垃圾邮件发送者有时会尝试在隐藏框架中随消息一起发送一张报纸:)

Try running some real spam from your spam folder, where you'd meet probabilities like 0.8. And guess why spammers sometime try to send a piece of newspaper in a hidden frame along with the message :)

这篇关于在朴素贝叶斯垃圾邮件过滤中结合个体概率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆