如何使用tf-idf选择停用词? (非英语语料库) [英] How to select stop words using tf-idf? (non english corpus)

查看:397
本文介绍了如何使用tf-idf选择停用词? (非英语语料库)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经设法评估了给定语料库的 tf-idf函数.如何找到每个文档的停用词和最佳词汇?我知道给定单词和文档的tf-idf低意味着它不是选择该文档的好单词.

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.

推荐答案

停用词是那些在文档中非常普遍出现的词,因此失去了它们的代表性.观察此现象的最佳方法是测量一个术语中出现的文档数量,并过滤出现在其中50%以上,或者您需要调整的前500个阈值或某种阈值中的文档.

Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.

文档中最好的术语(如更具代表性的术语)是具有较高tf-idf的术语,因为这些术语在文档中很常见,而在集合中很少见.

The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.

@Kevin指出,请注意,集合中非常常见的术语(即停用词)总会产生非常低的tf-idf.但是,它们将更改某些计算,如果您假设它们是纯噪声(根据任务可能不正确),这将是错误的.此外,如果将它们包括在内,您的算法将稍微慢一些.

As a quick note, as @Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.

正如@FelipeHammel所说,您可以直接使用IDF(请记住要反转顺序)作为与df成反比的度量.这完全等同于排名目的,因此可以选择排名靠前的"k"项.但是,虽然可以使用简单的阈值解决方法(例如,选择idf低于特定值的字词),但是无法使用它来基于比率(例如,出现在文档中超过50%的单词)进行选择. .通常,使用固定数量的术语.

edit: As @FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.

我希望这会有所帮助.

这篇关于如何使用tf-idf选择停用词? (非英语语料库)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆