文本分类的特征选择和归约 [英] Feature Selection and Reduction for Text Classification

查看:165
本文介绍了文本分类的特征选择和归约的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在研究一个项目,一个简单的情感分析器,这样在单独的情况下将有 2和3个班级.我使用的语料库唯一词(大约200.000)的方式非常丰富.我使用词袋方法进行功能选择,并减少了独特功能的数量,由于 出现频率的>阈值. 最终功能集包括大约20.000个功能,实际上减少了 90%,但对于预期的准确性,不足 强大的测试预测.我依次使用 LibSVM SVM-light 进行训练和预测( linear RBF内核),一般也 Python Bash .

I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.

到目前为止,最高准确度大约75%,而我至少需要90%. 二进制分类就是这种情况.对于多班训练,准确性下降到〜60%.在这两种情况下,我至少都需要90%,并且无法弄清楚如何增加它:通过优化训练参数通过优化功能选择?

The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?

我阅读了有关文本分类中功能选择的文章,发现使用了三种不同的方法,它们之间确实具有明显的相关性.这些方法如下:

I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:

  • 词袋(BOW)的频率方法
  • 信息增益(IG)
  • X ^ 2统计信息(CHI)
  • Frequency approach of bag-of-words (BOW)
  • Information Gain (IG)
  • X^2 Statistic (CHI)

第一种方法已经是我使用的方法,但是我非常简单地使用它,并且需要指导以更好地使用它,以获得足够高的精度.我还缺少有关 IG CHI 的实际实现的知识,并且正在寻找任何帮助以这种方式指导我的信息.

The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.

非常感谢,如果您需要任何其他信息来寻求帮助,请告诉我.

Thanks a lot, and if you need any additional info for help, just let me know.

  • @larsmans:频率阈值:我正在寻找示例中唯一单词的出现,因此,如果一个单词在不同示例中出现的频率足够高,它将包含在功能中设置为独特功能.

  • @larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.

@TheManWithNoName:首先,感谢您为解释文档分类的一般问题所做的努力.我检查并试验了您提出的所有方法以及其他方法.我发现比例差(PD)方法最适合特征选择,其中特征为单字组,权重为 Term Presence (TP)(我不明白为什么)您将术语-频率-文档反向频率(TF-IDF)标记为索引方法,而我将其视为特征加权方法).正如您提到的,预处理也是此任务的重要方面.我使用了某些类型的字符串消除功能来精炼数据以及形态分析茎梗.另外请注意,我正在研究土耳其语,与英语相比,该语言具有不同的特征.最终,对于 binary 分类,我设法达到〜88%准确度(f度量),对于多类别,我达到了〜84% .这些值是我使用的模型成功的可靠证明.到目前为止,这是我所做的.现在正在研究聚类和归约模型,尝试了 LDA LSI 并继续使用 moVMF 以及球形模型 (LDA + moVMF),这似乎对具有客观性质的语料库(例如新闻语料库)更有效.如果您对这些问题有任何信息和指导,我将不胜感激.我特别需要信息来设置减少特征空间尺寸方法(LDA,LSI,moVMF等)和聚类方法(k -均值,分层等).

@TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).

推荐答案

这可能要晚一些了,但是...

This is probably a bit late to the table, but...

正如Bee指出的那样,您已经知道,如果您在分类之前的阶段中已经丢失了信息,那么将SVM用作分类器就被浪费了.但是,文本分类的过程所需要的不仅是几个阶段,而且每个阶段都会对结果产生重大影响.因此,在研究更复杂的特征选择措施之前,有很多更简单的可能性,通常需要更低的资源消耗.

As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.

在进行标记/表示为词袋格式之前,您是否对文档进行了预处理?只需删除停用词或标点符号即可大大提高准确性.

Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.

您是否考虑过改用词袋表示法,例如改为使用词对或n-gram?您可能会发现开始时有更多维度,但它们会进一步缩小并包含更多有用信息.

Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.

还值得注意的是,降维 是功能选择/功能提取.区别在于特征选择以单变量的方式减小维数,即,它按当前出现的条件逐个删除术语,而不会改变它们,而特征提取(我认为Ben Allison指的是)是多变量的,结合了一个或多个单个项共同产生更高的正交项(希望)包含更多信息并减少特征空间.

Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.

关于文档使用频率,您只是在使用包含术语的文档的概率/百分比,还是在文档中使用的术语密度?如果类别1只有10个重复项,并且每个都包含一个词,那么类别1确实与文档相关联.但是,如果第二类只有10个文档,每个文档包含相同的术语,每一个都包含一百次,那么显然第二类与该术语的关系要比第一类高得多.如果不考虑术语密度,则会丢失此信息,并且您拥有的类别越少,损失对您的影响就越大.类似地,仅保留具有高频率的术语并不总是明智的,因为它们可能实际上并未提供任何有用的信息.例如,如果一个术语在每个文档中出现一百次,则该术语被认为是噪声术语,尽管看起来很重要,但将其保留在功能集中没有任何实际价值.

Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.

您还如何为数据建立索引,您是通过简单的布尔索引还是更复杂的度量(例如TF-IDF)使用向量空间模型?考虑到您的方案中类别的数量较少,一种更复杂的度量将是有益的,因为它们可以将每个类别的术语重要性与整个数据集中的重要性联系起来.

Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.

我个人将首先尝试上述几种可能性,然后在需要进一步提高性能的情况下,考虑使用(或组合)复杂方程式来调整特征选择/提取.

Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.

其他

基于新信息,这听起来像是在正确的道路上,并且对于大多数数据集而言,84%+的准确性(F1或BEP-基于多类问题的准确性和召回率)通常被认为非常好.可能是您已经从数据中成功获取了所有信息丰富的功能,或者其中一些仍在修剪中.

Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.

已经说过,离群值"分析可以用作预测特定数据集的主动降维效果的好坏因素,该分析使用偏远特征中信息增益的下降来确定其可能性在选择功能时这些信息将会丢失.您可以将其用于原始数据和/或处理后的数据,以估算您应以多大程度地修剪功能(或视情况取消修剪).可以在此处找到描述它的论文:

Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:

具有异常值计数信息的纸张竞争

关于将TF-IDF描述为索引方法,您将其作为特征权重度量是正确的,但我认为它主要用作索引过程的一部分(尽管也可以用于降维) ).这样做的原因是,某些措施更好地针对了特征选择/提取,而另一些措施则更适合于在文档向量(即索引数据)中专门进行特征加权.通常,这是由于按类别确定了降维措施,而索引加权措施往往更注重文档的定位,以提供更好的矢量表示.

With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.

关于LDA,LSI和moVMF,恐怕我对它们的经验太少,无法提供任何指导.不幸的是,我也没有使用土耳其语数据集或python语言.

In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.

这篇关于文本分类的特征选择和归约的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆