朴素贝叶斯用于使用“单词袋"的主题检测.方法 [英] Naive Bayesian for Topic detection using "Bag of Words" approach

查看:104
本文介绍了朴素贝叶斯用于使用“单词袋"的主题检测.方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实施一种朴素的贝叶斯方法来查找给定文档或单词流的主题.我是否可以找到朴素的贝叶斯方法?

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?

此外,我正在努力改进字典.最初,我有一堆单词映射到一个主题(硬编码).取决于出现的单词而不是已经映射的单词.根据这些单词的出现情况,我想将它们添加到映射中,从而改进和学习映射到主题的新单词.并且还会改变单词的概率.

Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.

我应该如何去做?我的方法正确吗?

How should i go about doing this ? Is my approach the right one ?

哪种编程语言最适合实现?

Which programming language would be best suited for the implementation ?

推荐答案

朴素贝叶斯的现有实现

仅使用一个支持使用朴素贝叶斯进行文档分类的现有软件包,可能会更好,例如:

You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:

Python -使用基于Python的 自然语言工具包(NLTK) ,请参见 NLTK图书中的文档分类" 部分.

Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.

Ruby -如果您更喜欢Ruby,则可以使用 分类器 宝石.以下示例代码可检测"Family Guy"引语是否有趣-有趣.

Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.

Perl -Perl具有 Algorithm :: NaiveBayes 模块,在包

Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.

C#-C#程序员可以使用 nBayes .该项目的主页上有用于简单垃圾邮件/非垃圾邮件分类器的示例代码.

C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.

Java -Java使用者有 Classifier4J .您可以在此处看到培训和评分代码段.

Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.

关键字的自举分类

听起来您想从一组可以提示某些主题的关键字开始,然后使用这些关键字来

It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.

这是一个相当聪明的主意.看看 通过关键词,EM和收缩自举进行文本经典化 McCallum和Nigam(1999)提出的"strong> ".通过采用这种方法,他们可以将分类准确率从仅使用硬编码关键字获得的45%提高到使用自举Naive Bayes分类器获得的66%.对于他们的数据,后者接近人类的共识水平,因为人们在72%的时间内就文件标签达成了共识.

This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

这篇关于朴素贝叶斯用于使用“单词袋"的主题检测.方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆