在NLTK Python的朴素贝叶斯分类器中使用文档长度 [英] Using document length in the Naive Bayes Classifier of NLTK Python

查看:97
本文介绍了在NLTK Python的朴素贝叶斯分类器中使用文档长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Python中使用NLTK构建垃圾邮件过滤器.现在,我检查单词的出现情况,并使用NaiveBayesClassifier来得出.98的准确性和F度量,对于垃圾邮件的准确性为0.92,对于非垃圾邮件的准确性为0.98.但是,在检查程序错误所在的文档时,我注意到许多被归类为非垃圾邮件的垃圾邮件都是非常短的消息.

I am building a spam filter using the NLTK in Python. I now check for the occurances of words and use the NaiveBayesClassifier resulting in an accuracy of .98 and F measure for spam of .92 and for non-spam: 0.98. However when checking the documents in which my program errors I notice that a lot of spam that is classified as non-spam are very short messages.

因此,我想将文档的长度作为NaiveBayesClassifier的功能.问题在于它现在仅处理二进制值.除了例如说,还有其他方法可以做到:length< 100 = true/false?

So I want to put the length of a document as a feature for the NaiveBayesClassifier. The problem is it now only handles binary values. Is there any other way to do this than for example say: length<100 =true/false?

(ps,我已经构建了类似于 http:/的垃圾邮件检测器/nltk.googlecode.com/svn/trunk/doc/book/ch06.html 示例)

(p.s. I have build the spam detector analogous to the http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html example)

推荐答案

NLTK的朴素贝叶斯实现没有做到这一点,但是您可以将NaiveBayesClassifier的预测与文档长度上的分布结合起来. NLTK的prob_classify方法将为您提供在给定文档中单词(即P(cl | doc))的类上的条件概率分布.您想要的是P(cl | doc,len)-给定文档中单词的类及其长度的概率.如果再做一些独立性假设,我们将得到:

NLTK's implementation of Naive Bayes doesn't do that, but you could combine NaiveBayesClassifier's predictions with a distribution over document lengths. NLTK's prob_classify method will give you a conditional probability distribution over classes given the words in the document, i.e., P(cl|doc). What you want is P(cl|doc,len) -- the probability of a class given the words in the document and its length. If we make a few more independence assumptions, we get:

P(cl|doc,len) = (P(doc,len|cl) * P(cl)) / P(doc,len)
              = (P(doc|cl) * P(len|cl) * P(cl)) / (P(doc) * P(len))
              = (P(doc|cl) * P(cl)) / P(doc) * P(len|cl) / P(len)
              = P(cl|doc) * P(len|cl) / P(len)

您已经从prob_classify获得了第一项,因此剩下要做的就是估算P(len | cl)和P(len).

You've already got the first term from prob_classify, so all that's left to do is to estimate P(len|cl) and P(len).

在对文档长度进行建模时,您可以任意选择,但是要开始使用,您可以假设文档长度的日志是正态分布的.如果您知道每个类别以及整体中日志文件长度的平均值和标准偏差,那么可以很容易地计算出P(len | cl)和P(len).

You can get as fancy as you want when it comes to modeling document lengths, but to get started you can just assume that the logs of the document lengths are normally distributed. If you know the mean and the standard deviation of the log document lengths in each class and overall, it's then easy to calculate P(len|cl) and P(len).

这是估算P(len)的一种方法:

Here's one way of going about estimating P(len):

from nltk.corpus import movie_reviews
from math import sqrt,log
import scipy

loglens = [log(len(movie_reviews.words(f))) for f in movie_reviews.fileids()]
sd = sqrt(scipy.var(loglens)) 
mu = scipy.mean(loglens)

p = scipy.stats.norm(mu,sd)

要记住的唯一棘手的事情是,这是对数长度而不是长度的分布,并且是连续分布.因此,长度为L的文档的概率为:

The only tricky things to remember are that this is a distribution over log-lengths rather than lengths and that it's a continuous distribution. So, the probability of a document of length L will be:

p.cdf(log(L+1)) - p.cdf(log(L))

可以使用每个类中文档的对数长度,以相同的方式估算条件长度分布.那应该给您您需要P(cl | doc,len).

The conditional length distributions can be estimated in the same way, using the log-lengths of the documents in each class. That should give you what you need for P(cl|doc,len).

这篇关于在NLTK Python的朴素贝叶斯分类器中使用文档长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆