文本分类的特征选择和减少 [英] Feature Selection and Reduction for Text Classification

查看:32
本文介绍了文本分类的特征选择和减少的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在做一个项目,一个简单的情绪分析器,这样在单独的案例中会有2和3个类.我使用的语料库独特的词(大约 200.000)方面非常丰富.我使用词袋方法进行特征选择并减少独特特征的数量,由于消除发生频率的>阈值.最终的一组特征包括大约 20.000 个特征,这实际上减少了 90%,但不足以达到预期的准确性测试预测的强>.我依次使用 LibSVMSVM-light 进行训练和预测(线性RBF 内核)和还有PythonBash.

I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.

目前观察到的最高准确度大约为 75%,而我至少需要 90%.二元分类就是这种情况.对于多类训练,准确率下降到~60%.我在两种情况下都需要至少 90% 并且不知道如何增加它:通过优化训练参数通过优化特征选择?

The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?

我读过关于文本分类中特征选择的文章,我发现使用了三种不同的方法,它们之间实际上有明显的相关性.这些方法如下:

I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:

  • 词袋(BOW)的频率方法
  • 信息增益 (IG)
  • X^2 统计数据 (CHI)
  • Frequency approach of bag-of-words (BOW)
  • Information Gain (IG)
  • X^2 Statistic (CHI)

第一种方法已经是我使用的方法,但我使用它非常简单,需要指导才能更好地使用它以获得足够高的准确性.我也缺乏关于 IGCHI 的实际实施的知识,正在寻找任何帮助来指导我.

The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.

非常感谢,如果您需要任何其他信息以寻求帮助,请告诉我.

Thanks a lot, and if you need any additional info for help, just let me know.

  • @larsmans:频率阈值:我正在寻找示例中出现的唯一单词,这样如果一个单词在不同示例中出现的频率足够高,它就会包含在特征中设置为一个独特的功能.

  • @larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.

@TheManWithNoName:首先感谢您努力解释文档分类的一般问题.我检查并试验了您提出的所有方法和其他方法.我发现 Proportional Difference (PD) 方法最适合特征选择,其中特征是 uni-grams 和 Term Presence (TP) 用于加权(我不明白为什么您将 Term-Frequency-Inverse-Document-Frequency (TF-IDF) 标记为一种索引方法,我宁愿将其视为一种特征加权方法).正如您所提到的,预处理也是这项任务的一个重要方面.我使用了某些类型的字符串消除来优化数据以及形态解析词干提取.另请注意,我正在研究土耳其语,它与英语相比具有不同的特征.最后,对于二元分类,我设法达到了~88% 的准确率(f-measure),对于多分类,我达到了~84%.这些值是我使用的模型成功的有力证明.这是我迄今为止所做的.现在正在研究聚类和归约模型,尝试了 LDALSI 并继续使用 moVMF球形模型(LDA + moVMF),这似乎在具有客观性质的语料库上效果更好,例如新闻语料库.如果您有关于这些问题的任何信息和指导,我将不胜感激.我特别需要信息来设置特征空间降维方法(LDA、LSI、moVMF等)和聚类方法(k-手段,分层等).

@TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).

推荐答案

这可能有点晚了,但是...

This is probably a bit late to the table, but...

正如 Bee 指出的并且您已经意识到,如果您在分类之前的各个阶段已经丢失了信息,那么使用 SVM 作为分类器就是浪费.然而,文本分类的过程需要的不仅仅是几个阶段,每个阶段都会对结果产生重大影响.因此,在研究更复杂的特征选择措施之前,有许多更简单的可能性,通常需要低得多的资源消耗.

As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.

在将标记化/表示化为词袋格式之前,您是否对文档进行了预处理?简单地删除停用词或标点符号可能会大大提高准确性.

Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.

您是否考虑过更改词袋表示以使用例如词对或 n-gram?您可能会发现开始时有更多维度,但它们进一步浓缩并包含更多有用信息.

Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.

还值得注意的是,降维特征选择/特征提取.不同之处在于特征选择以单变量的方式减少维度,即它在不改变它们的情况下逐个删除当前出现的术语,而特征提取(我认为 Ben Allison 指的是)是多变量的,结合了一个或多个单个词条一起产生更高的正交词条,(希望)包含更多信息并减少特征空间.

Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.

关于您对文档频率的使用,您是仅使用包含术语的文档的概率/百分比还是使用文档中的术语密度?如果第一类只有 10 个文件,并且每个文件都包含一个术语,那么第一类确实与文档相关联.然而,如果类别二只有 10 个文档,每个文档都包含一百次相同的术语,那么显然类别二与该术语的关系比类别一高得多.如果不考虑术语密度,则此信息将丢失,您拥有的类别越少,此损失的影响就越大.与此类似,仅保留频率较高的术语并不总是明智的,因为它们实际上可能不会提供任何有用的信息.例如,如果一个词在每个文档中出现一百次,那么它就被认为是一个噪音词,虽然它看起来很重要,但将它保留在您的功能集中没有实际价值.

Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.

另外,您如何索引数据,您是使用带有简单布尔索引的向量空间模型还是更复杂的度量(例如 TF-IDF)?考虑到您的场景中的类别数量较少,更复杂的衡量标准将是有益的,因为它们可以说明每个类别的术语重要性与其在整个数据集中的重要性.

Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.

就我个人而言,我会首先尝试上述一些可能性,然后如果您需要额外的性能提升,然后考虑使用(或组合)复杂方程来调整特征选择/提取.

Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.

附加

根据新信息,听起来好像您走在正确的轨道上,并且 84% 以上的准确率(F1 或 BEP - 基于多类问题的准确率和召回率)对于大多数数据集来说通常被认为是非常好的.可能是您已经成功地从数据中获取了所有信息丰富的特征,或者仍有一些正在被修剪.

Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.

话虽如此,可以用作预测特定数据集的积极降维效果如何的是异常值计数"分析,它使用外围特征中信息增益的下降来确定它的可能性有多大该信息将在特征选择期间丢失.您可以在原始数据和/或处理过的数据上使用它来估计您应该如何积极地修剪特征(或根据情况取消修剪它们).可以在此处找到描述它的论文:

Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:

具有异常值计数信息的论文

关于将 TF-IDF 描述为一种索引方法,您认为它是一种特征加权度量是正确的,但我认为它主要用作索引过程的一部分(尽管它也可以用于降维).这样做的原因是,某些措施更适合用于特征选择/提取,而其他措施则更适合用于文档向量(即索引数据)中的特征加权.这通常是由于降维措施是在每个类别的基础上确定的,而索引权重措施往往更加面向文档以提供卓越的向量表示.

With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.

关于LDA、LSI和moVMF,恐怕我对它们的经验太少,无法提供任何指导.不幸的是,我也没有使用过土耳其语数据集或 python 语言.

In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.

这篇关于文本分类的特征选择和减少的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆