将文档分类 [英] Classifying Documents into Categories

查看:61
本文介绍了将文档分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Postgres数据库中存储了大约30万个文档,这些文档用主题类别标记(总共约有150个类别).我还有另外15万个尚未分类的文档.我正在尝试找到以编程方式对它们进行分类的最佳方法.

I've got about 300k documents stored in a Postgres database that are tagged with topic categories (there are about 150 categories in total). I have another 150k documents that don't yet have categories. I'm trying to find the best way to programmaticly categorize them.

我一直在探索 NLTK 及其朴素贝叶斯分类器.似乎是一个很好的起点(如果您可以为此任务建议一个更好的分类算法,我将不知所措).

I've been exploring NLTK and its Naive Bayes Classifier. Seems like a good starting point (if you can suggest a better classification algorithm for this task, I'm all ears).

我的问题是我没有足够的RAM来一次对所有150个类别/300k文档进行NaiveBayesClassifier训练(对使用8GB的5个类别进行训练).此外,随着我​​对更多类别进行训练,分类器的准确性似乎有所下降(2个类别的准确性为90%,5个类别的准确性为81%,10个类别的准确性为

My problem is that I don't have enough RAM to train the NaiveBayesClassifier on all 150 categoies/300k documents at once (training on 5 categories used 8GB). Furthermore, accuracy of the classifier seems to drop as I train on more categories (90% accuracy with 2 categories, 81% with 5, 61% with 10).

我是否应该一次对5个类别的分类器进行训练,然后通过分类器运行所有150k文档,以查看是否存在匹配项?看起来这是可行的,除了会出现很多误报,因为与最佳分类完全不匹配的文档会被分类器误入歧途,因为这是现有的最佳匹配方法.如果文档不适合任何类别,一种为分类器提供以上皆非"选项的方法?

Should I just train a classifier on 5 categories at a time, and run all 150k documents through the classifier to see if there are matches? It seems like this would work, except that there would be a lot of false positives where documents that don't really match any of the categories get shoe-horned into on by the classifier just because it's the best match available... Is there a way to have a "none of the above" option for the classifier just in case the document doesn't fit into any of the categories?

这是我的测试课程 http://gist.github.com/451880

推荐答案

您应该先将文档转换为 TF-log(1 + IDF)向量:词频很稀疏,因此您应该使用以term为键的python dict并作为值计数,然后除以总计数即可得出全局频率.

You should start by converting your documents into TF-log(1 + IDF) vectors: term frequencies are sparse so you should use python dict with term as keys and count as values and then divide by total count to get the global frequencies.

另一种解决方案是将abs(hash(term))用作正整数键.然后使用scipy.sparse向量,它比python dict更方便,更有效地执行线性代数运算.

Another solution is to use the abs(hash(term)) for instance as positive integer keys. Then you an use scipy.sparse vectors which are more handy and more efficient to perform linear algebra operation than python dict.

还可以通过平均属于同一类别的所有带标签文档的频率来构建150个频率向量.然后,对于要标记的新文档,您可以计算文档向量和每个类别向量之间的余弦相似度然后选择最相似的类别作为文档的标签.

Also build the 150 frequencies vectors by averaging the frequencies of all the labeled documents belonging to the same category. Then for new document to label, you can compute the cosine similarity between the document vector and each category vector and choose the most similar category as label for your document.

如果这还不够好,那么您应该尝试使用L1惩罚来训练逻辑回归模型,如 scikit-learn (此是liblinear的包装器,如@ephes所述).用于训练逻辑回归模型的向量应该是先前引入的TD-log(1 + IDF)向量,以获得良好的性能(精确度和召回率). scikit learning lib提供了一个sklearn.metrics模块,该模块具有用于计算给定模型和给定数据集的得分的例程.

If this is not good enough, then you should try to train a logistic regression model using a L1 penalty as explained in this example of scikit-learn (this is a wrapper for liblinear as explained by @ephes). The vectors used to train your logistic regression model should be the previously introduced TD-log(1+IDF) vectors to get good performance (precision and recall). The scikit learn lib offers a sklearn.metrics module with routines to compute those score for a given model and given dataset.

对于较大的数据集:您应该尝试 vowpal兔子,它可能是地球上最快的兔子缩放文档分类问题(但不容易使用python包装器AFAIK).

For larger datasets: you should try the vowpal wabbit which is probably the fastest rabbit on earth for large scale document classification problems (but not easy to use python wrappers AFAIK).

这篇关于将文档分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆