解释跨文档单词的 TF-IDF 分数总和 [英] Interpreting the sum of TF-IDF scores of words across documents

查看:21
本文介绍了解释跨文档单词的 TF-IDF 分数总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先让我们提取每个文档每个术语的 TF-IDF 分数:

from gensim 导入语料库、模型、相似点文档 = [实验室 abc 计算机应用程序的人机界面",《用户对计算机系统响应时间意见的调查》,《EPS用户界面管理系统》,《EPS的系统与人体系统工程测试》,用户感知响应时间与错误测量的关系",《随机二叉无序树的生成》,"树中路径的交集图",图未成年人IV树的宽度和井准排序",《图未成年人调查》]stoplist = set('for a of the and to in'.split())texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]字典 = corpora.Dictionary(texts)corpus = [dictionary.doc2bow(text) for texts]tfidf = 模型.TfidfModel(语料库)corpus_tfidf = tfidf[语料库]

打印出来:

for corpus_tfidf 中的文档:打印文档

[输出]:

<预> <代码> [(0,0.4301019571350565),(1,0.4301019571350565),(2,0.4301019571350565),(3,0.4301019571350565),(4,0.2944198962221451),(5,0.2944198962221451),(6,0.2944198962221451)][(4,0.3726494271826947),(7,0.27219160459794917),(8,0.3726494271826947),(9,0.27219160459794917),(10,0.3726494271826947),(11,0.5443832091958983),(12,0.3726494271826947)][(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.64055510089412437) 8894164379]8[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.50397332313948695)]43039595[(9,0.21953536176370683),(10,0.30055933182961736),(12,0.30055933182961736),(17,0.43907072352741366),(18,0.43907072352741366),(19,0.43907072352741366),(20,0.43907072352741366)][(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.4850712655907), 2655953737), 25595373[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555735903)36[(25,0.20466057569885868),(26,0.20466057569885868),(29,0.2801947048062438),(30,0.40932115139771735),(31,0.40932115139771735),(32,0.40932115139771735),(33,0.40932115139771735),(34,0.40932115139771735)][(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]

如果我们想在这个语料库中找到单词的显着性"或重要性",我们可以简单地将所有文档的 tf-idf 分数相加并除以文档数量吗?

<预><代码>>>>tfidf_saliency = Counter()>>>对于 corpus_tfidf 中的文档:...一句话,在文档中得分:... tfidf_saliency[word] += score/len(corpus_tfidf)...>>>tfidf_saliency计数器({7:0.12182694202050007,8:0.11121194156107769,26:0.10886469856464989,29:0.10093919463036093,9:0.09022272408985754,14:0.08705221175200946,25:0.08482488519466996,6:0.08143359568202602,10:0.07480097322359022,12:0.07480097322359022,4:0.07411881371164887,13:0.07117278898823597,5:0.07104525967490458,27:0.07027283689263066,28:0.07027283689263066,11:0.060487023243988705,15:0.055997035904387725,16:0.055997035904387725,21:0.05389680556362955,22:0.05389680556362955,23:0.05389680556362955,24:0.05389680556362955,17:0.048785635947490406,18:0.048785635947490406,19:0.048785635947490406,20:0.048785635947490406,0:0.04778910634833961,1:0.04778910634833961,2:0.04778910634833961,3:0.04778910634833961,30:0.045480127933079706,31:0.045480127933079706,32:0.045480127933079706,33:0.045480127933079706,34:0.045480127933079706})

查看输出,我们是否可以假设语料库中最突出"的词是:

<预><代码>>>>字典[7]你'系统'>>>字典[8]你'调查'>>>字典[26]你'图'

如果是,文档中单词的 TF-IDF 分数总和的数学解释是什么?

解决方案

TF-IDF in corpus 的解释是语料库中给定term的最高TF-IDF.

在 corpus_tfidf 中查找热门词.

 topWords = {}对于 corpus_tfidf 中的文档:对于 iWord,文档中的 tf_idf:如果 iWord 不在 topWords 中:topWords[iWord] = 0如果 tf_idf >topWords[iWord]:topWords[iWord] = tf_idf对于 i,枚举中的项目(排序(topWords.items(), key=lambda x: x[1], reverse=True), 1):打印("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))如果我 == 6:打破

输出比较车:
注意:不能使用 gensim,用 corpus_tfidf 创建匹配的 dictionary.
只能显示 Word Indizies.

问题 tfidf_saliency topWords(corpus_tfidf) 其他TF-IDF实现---------------------------------------------------------------------------1:字(7) 0.121 1:字(13) 0.640 1:路径0.3760192:字(8) 0.111 2:字(27) 0.632 2:交集0.3760193:Word(26) 0.108 3:Word(28) 0.632 3:调查 0.3662044:字(29) 0.100 4:字(8) 0.628 4:未成年人 0.3662045:字(9) 0.090 5:字(29) 0.628 5:二进制0.3008156:字(14) 0.087 6:字(11) 0.544 6:代0.300815

TF-IDF 的计算总是考虑语料库.

使用 Python:3.4.2 测试

First let's extract the TF-IDF scores per term per document:

from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]

Printing it out:

for doc in corpus_tfidf:
    print doc

[out]:

[(0, 0.4301019571350565), (1, 0.4301019571350565), (2, 0.4301019571350565), (3, 0.4301019571350565), (4, 0.2944198962221451), (5, 0.2944198962221451), (6, 0.2944198962221451)]
[(4, 0.3726494271826947), (7, 0.27219160459794917), (8, 0.3726494271826947), (9, 0.27219160459794917), (10, 0.3726494271826947), (11, 0.5443832091958983), (12, 0.3726494271826947)]
[(6, 0.438482464916089), (7, 0.32027755044706185), (9, 0.32027755044706185), (13, 0.6405551008941237), (14, 0.438482464916089)]
[(5, 0.3449874408519962), (7, 0.5039733231394895), (14, 0.3449874408519962), (15, 0.5039733231394895), (16, 0.5039733231394895)]
[(9, 0.21953536176370683), (10, 0.30055933182961736), (12, 0.30055933182961736), (17, 0.43907072352741366), (18, 0.43907072352741366), (19, 0.43907072352741366), (20, 0.43907072352741366)]
[(21, 0.48507125007266594), (22, 0.48507125007266594), (23, 0.48507125007266594), (24, 0.48507125007266594), (25, 0.24253562503633297)]
[(25, 0.31622776601683794), (26, 0.31622776601683794), (27, 0.6324555320336759), (28, 0.6324555320336759)]
[(25, 0.20466057569885868), (26, 0.20466057569885868), (29, 0.2801947048062438), (30, 0.40932115139771735), (31, 0.40932115139771735), (32, 0.40932115139771735), (33, 0.40932115139771735), (34, 0.40932115139771735)]
[(8, 0.6282580468670046), (26, 0.45889394536615247), (29, 0.6282580468670046)]

If we want to find the "saliency" or "importance" of the words within this corpus, can we simple do the sum of the tf-idf scores across all documents and divide it by the number of documents? I.e.

>>> tfidf_saliency = Counter()
>>> for doc in corpus_tfidf:
...     for word, score in doc:
...         tfidf_saliency[word] += score / len(corpus_tfidf)
... 
>>> tfidf_saliency
Counter({7: 0.12182694202050007, 8: 0.11121194156107769, 26: 0.10886469856464989, 29: 0.10093919463036093, 9: 0.09022272408985754, 14: 0.08705221175200946, 25: 0.08482488519466996, 6: 0.08143359568202602, 10: 0.07480097322359022, 12: 0.07480097322359022, 4: 0.07411881371164887, 13: 0.07117278898823597, 5: 0.07104525967490458, 27: 0.07027283689263066, 28: 0.07027283689263066, 11: 0.060487023243988705, 15: 0.055997035904387725, 16: 0.055997035904387725, 21: 0.05389680556362955, 22: 0.05389680556362955, 23: 0.05389680556362955, 24: 0.05389680556362955, 17: 0.048785635947490406, 18: 0.048785635947490406, 19: 0.048785635947490406, 20: 0.048785635947490406, 0: 0.04778910634833961, 1: 0.04778910634833961, 2: 0.04778910634833961, 3: 0.04778910634833961, 30: 0.045480127933079706, 31: 0.045480127933079706, 32: 0.045480127933079706, 33: 0.045480127933079706, 34: 0.045480127933079706})

Looking at the output, could we assume that the most "prominent" word in the corpus is:

>>> dictionary[7]
u'system'
>>> dictionary[8]
u'survey'
>>> dictionary[26]
u'graph'

If so, what is the mathematical interpretation of the sum of TF-IDF scores of words across documents?

解决方案

The interpretation of TF-IDF in corpus is the highest TF-IDF in corpus for a given term.

Find the Top Words in corpus_tfidf.

    topWords = {}
    for doc in corpus_tfidf:
        for iWord, tf_idf in doc:
            if iWord not in topWords:
                topWords[iWord] = 0

            if tf_idf > topWords[iWord]:
                topWords[iWord] = tf_idf

    for i, item in enumerate(sorted(topWords.items(), key=lambda x: x[1], reverse=True), 1):
        print("%2s: %-13s %s" % (i, dictionary[item[0]], item[1]))
        if i == 6: break

Output comparison cart:
NOTE: Could'n use gensim, to create a matching dictionary with corpus_tfidf.
Can only display Word Indizies.

Question tfidf_saliency   topWords(corpus_tfidf)  Other TF-IDF implentation  
---------------------------------------------------------------------------  
1: Word(7)   0.121        1: Word(13)    0.640    1: paths         0.376019  
2: Word(8)   0.111        2: Word(27)    0.632    2: intersection  0.376019  
3: Word(26)  0.108        3: Word(28)    0.632    3: survey        0.366204  
4: Word(29)  0.100        4: Word(8)     0.628    4: minors        0.366204  
5: Word(9)   0.090        5: Word(29)    0.628    5: binary        0.300815  
6: Word(14)  0.087        6: Word(11)    0.544    6: generation    0.300815  

The calculation of TF-IDF takes always the corpus in account.

Tested with Python:3.4.2

这篇关于解释跨文档单词的 TF-IDF 分数总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆