Python NLTK朴素贝叶斯分类器:该分类器用于对输入进行分类的基础计算是什么? [英] Python NLTK Naive Bayes Classifier: What is the underlying computation that this classifier uses to classifiy input?

查看:357
本文介绍了Python NLTK朴素贝叶斯分类器:该分类器用于对输入进行分类的基础计算是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在以下示例中,我使用Python NLTK中的朴素贝叶斯分类器来计算概率分布:

import nltk

def main():
    train = [(dict(feature=1), 'class_x'), (dict(feature=0), 'class_x'),   (dict(feature=0), 'class_y'), (dict(feature=0), 'class_y')]

    test = [dict(feature=1)]

    classifier = nltk.classify.NaiveBayesClassifier.train(train)

    print("classes available: ", sorted(classifier.labels()))

    print ("input assigned to: ", classifier.classify_many(test))

    for pdist in classifier.prob_classify_many(test):
        print ("probability distribution: ")
        print ('%.4f %.4f' % (pdist.prob('class_x'), pdist.prob('class_y')))

if __name__ == '__main__':
    main()

训练数据集中有两个类(class_x和class_y).每个类有两个输入.对于class_x,第一个输入要素的值为1,第二个输入要素的值为0.对于class_y,两个输入要素的值为0.测试数据集由一个输入组成,值为1. /p>

运行代码时,输​​出为:

classes available:  ['class_x', 'class_y']
input assigned to:  ['class_x']
0.7500 0.2500

要获取每个类别的概率,可能性,分类器应将类别的先验(在这种情况下为0.5)乘以类别中每个要素的概率.应该考虑平滑.

我通常使用与此类似的公式(或类似的变体):

P(特征|类别)=类别的先验*类别 +1 中特征的频率/类别中的总特征+ 词汇量.平滑可能会有所不同,并且会稍微改变结果.

在上面的示例代码中,分类器如何精确计算概率分布?使用的公式是什么?

我在此处解决方案

来自源代码

https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L9yo

|                       P(label) * P(features|label)
|  P(label|features) = ------------------------------
|                              P(features)

I use the Naive Bayes classifier in Python NLTK to compute the probability distribution for the following example:

import nltk

def main():
    train = [(dict(feature=1), 'class_x'), (dict(feature=0), 'class_x'),   (dict(feature=0), 'class_y'), (dict(feature=0), 'class_y')]

    test = [dict(feature=1)]

    classifier = nltk.classify.NaiveBayesClassifier.train(train)

    print("classes available: ", sorted(classifier.labels()))

    print ("input assigned to: ", classifier.classify_many(test))

    for pdist in classifier.prob_classify_many(test):
        print ("probability distribution: ")
        print ('%.4f %.4f' % (pdist.prob('class_x'), pdist.prob('class_y')))

if __name__ == '__main__':
    main()

There are two classes (class_x and class_y) in the training dataset. Two inputs are given to each of the classes. For class_x, the first input feature has a value of 1, and the second a value of 0. For class_y, both input features have a value of 0. The test dataset is made up of one input, with a value of 1.

When I run the code, the output is:

classes available:  ['class_x', 'class_y']
input assigned to:  ['class_x']
0.7500 0.2500

To get the probabilities, or likelihoods, for each class, the classifier should multiply the prior of the class (in this case, 0.5) by the probabilities of each of the features in the class. Smoothing should be considered.

I usually use a formula similar to this (or a similar variant):

P(feature|class) = prior of class * frequency of feature in class +1 / total features in class + Vocabulary size. Smoothing can vary and slightly changes the outcome.

In the example code above, how exactly does the classifier compute the probability distribution? What is the formula used?

I checked here and here, but could not get any information as to exactly how the computation is done.

Thanks in advance.

解决方案

From the source code

https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L9yo

|                       P(label) * P(features|label)
|  P(label|features) = ------------------------------
|                              P(features)

这篇关于Python NLTK朴素贝叶斯分类器:该分类器用于对输入进行分类的基础计算是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆