如何提高朴素贝叶斯分类器的准确性? [英] Ways to improve the accuracy of a Naive Bayes Classifier?

查看:263
本文介绍了如何提高朴素贝叶斯分类器的准确性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用朴素贝叶斯分类器将数千个文档分类为30个不同的类别.我已经实现了朴素贝叶斯分类器,并且通过一些功能选择(主要是过滤无用的单词),我获得了大约30%的测试准确度和45%的训练准确度.这比随机的要好得多,但我希望更好.

我已经尝试过用NB实现AdaBoost,但是它似乎并没有给出明显更好的结果(关于这方面的文献似乎分歧很大,有些论文说使用NB的AdaBoost不会给出更好的结果,另一些则表明).您是否知道NB的其他扩展可能会带来更高的准确性?

解决方案

以我的经验,训练有素的朴素贝叶斯分类器通常令人惊讶地准确(并且训练速度非常快,比我曾经使用的任何分类器生成器都要快).

因此,当您要改进分类器预测时,可以在几个地方查看:

  • 调整分类器(调整分类器的可调参数);

  • 应用某种分类器组合技术(例如,集合,加强,装袋);或者你可以

  • 查看输入给分类器的数据-添加更多数据,改善您的基本分析,或完善您从中选择的功能数据.

w/r/t天真贝叶斯分类器,参数调整有限;我建议重点关注您的数据,即预处理的质量和功能选择.

I.数据解析(预处理)

我假设您的原始数据类似于每个数据点的原始文本字符串,通过一系列处理步骤,您可以将每个字符串转换为每个数据点的结构化矢量(一维数组),以使每个偏移量对应于一个特征(通常是一个单词),并且该偏移量中的值对应于频率.

  • 梗死:是手动还是使用词干库?最受欢迎的开源软件有Porter,Lancaster和Snowball.因此对于例如,如果您有术语程序员,程序,编程,在给定的数据点进行编程,词干提取器会将其减少到单个词干(可能是 program ),因此该数据的术语向量点将在要素程序中的值为4,即可能是您想要的.

  • 同义词查找:与词干相同的想法-将相关单词折叠成一个单词;因此同义词查找器可以识别开发人员,程序员,编码人员和软件工程师,并将它们合并为一个术语

  • 中性词:跨类频率相似的词会导致特征不佳


II.功能选择

考虑NBC的典型用例:过滤垃圾邮件;您可以快速查看它如何失败,也可以快速查看如何改进它.例如,高于平均水平的垃圾邮件过滤器具有细微的特征,例如:所有大写字母的单词出现频率,标题中单词的出现频率以及标题中出现感叹号.另外,最佳功能通常不是单个单词,而是例如成对单词或更大的单词组.

III.特定的分类器优化

使用'一对多'计划,而不是30个类,换句话说,您首先使用两类分类器(A类和所有其他"),然后是结果所有其他"类别中的"返回算法,以分类为B类和所有其他"等.

费舍尔方法(可能是最优化朴素贝叶斯分类器的最常用方法.)对我来说,我认为Fisher是 normalization (更准确地说是 standardizing )输入概率.NBC使用特征概率来构建整个文档"概率.Fisher方法计算文档的每个特征的类别的概率,然后将这些特征概率进行组合,并将该组合的概率与特征的随机集合的概率进行比较.

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.

I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?

解决方案

In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).

so when you want to improve classifier prediction, you can look in several places:

  • tune your classifier (adjusting the classifier's tunable paramaters);

  • apply some sort of classifier combination technique (eg, ensembling, boosting, bagging); or you can

  • look at the data fed to the classifier--either add more data, improve your basic parsing, or refine the features you select from the data.

w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.

I. Data Parsing (pre-processing)

i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.

  • stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for instance, if you have the terms programmer, program, progamming, programmed in a given data point, a stemmer will reduce them to a single stem (probably program) so your term vector for that data point will have a value of 4 for the feature program, which is probably what you want.

  • synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer, coder, and software engineer and roll them into a single term

  • neutral words: words with similar frequencies across classes make poor features


II. Feature Selection

consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.

III. Specific Classifier Optimizations

Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.

The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me, i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.

这篇关于如何提高朴素贝叶斯分类器的准确性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆