TfidfVectorizer似乎给出了不正确的结果 [英] TfidfVectorizer seems to be giving incorrect results

查看:27
本文介绍了TfidfVectorizer似乎给出了不正确的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  • 我有一个长度为7(7个主题)的列表
  • 列表中的每个元素都包含一个很长的单词字符串。
  • 列表中的每个元素都可以被视为一个主题,其中有一个长句将其区分开来
  • 我要检查哪些单词使每个主题具有唯一性(列表中的每个元素)

以下是我的代码:

from sklearn.feature_extraction.text import TfidfVectorizer
train = read_train_file() # A list with huge sentences that I can't paste here

tfidfvectorizer = TfidfVectorizer(analyzer= 'word', stop_words= 'english')
tfidf_wm        = tfidfvectorizer.fit_transform(train)
tfidf_tokens    = tfidfvectorizer.get_feature_names()

df_tfidfvect = pd.DataFrame(data = tfidf_wm.toarray(), index=train_df.discourse_type.unique(), columns = tfidf_tokens)


for col in df_tfidfvect.T.columns:    
    print(f"
subjetct: {col}")
    print(df_tfidfvect.T[col].nlargest(2))

部分列车数据:

for i, v in enumerate(train):
    print(f"subject: {i}: {train[i][:50]}")

输出:

subjetct: Position
people    0.316126
school    0.211516
Name: Position, dtype: float64

subjetct: Claim
people    0.354722
school    0.296632
Name: Claim, dtype: float64

subjetct: Evidence
people    0.366234
school    0.282213
Name: Evidence, dtype: float64

subjetct: Concluding Statement
people    0.385200
help      0.267567
Name: Concluding Statement, dtype: float64

subjetct: Lead
people    0.399011
school    0.336605
Name: Lead, dtype: float64

subjetct: Counterclaim
people       0.361070
electoral    0.321909
Name: Counterclaim, dtype: float64

subjetct: Rebuttal
people    0.31029
school    0.26789
Name: Rebuttal, dtype: float64

如您所见,";People";和";School";具有较高的TF-IDF值。

可能我错了,但我预计某个主题的专有词汇不会在所有主题中都是相同的(根据TF-IDF公式)。

部分列车数据:

for i, v in enumerate(train):
    print(f"subject: {i}: {train[i][:50]}")

subject: 0: like policy people average cant play sports b poin
subject: 1: also stupid idea sports suppose fun privilege play
subject: 2: failing fail class see act higher c person could g
subject: 3: unfair rule thought think new thing shaped land fo
subject: 4: land form found human thought many either fight de
subject: 5: want say know trying keep class also quite expensi
subject: 6: even less sense saying first find something really

那么TfidfVectorizer有什么问题?

推荐答案

根据TfidfVectorizer(实际上是TfidfTransformer,内部用于将计数矩阵转换为TF-idf表示)上的文档,idf公式:

计算为idf(t) = log [ n / df(t) ] + 1(如果 smooth_idf=False),其中n是 文档集合,df(T)是t的文档频率;文档 频率是文档集中包含以下内容的文档数 术语t。

请注意,上述IDF公式与将IDF定义为的标准教科书符号不同 idf(t) = log [ n / (df(t) + 1) ]

如果smooth_idf=True(默认值),常量"1"将添加到 IDF的分子和分母,就像看到额外的文档一样 将集合中的每个术语恰好包含一次,从而防止 零分割:idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.

简而言之,skLearning的TfidfVectorizer使用与标准公式不同的公式,标准公式通常是idf(t) = log [ n / df(t) ]idf(t) = log [ n / (df(t) + 1) ](如果一个术语不在语料库中,则会调整分母以防止零除)。此外,

tf默认情况下为(自然)(&Q;n)

意思是skLearning使用的是tf术语在文档中出现的次数,而不是相对频率,即(number of times term 't' occurs in a document) / (number of terms in a document)。进一步的skLearning使用余弦相似归一化:

范数=‘L2’时归一化为"c"(余弦)

由于上述原因,结果可能与应用标准TF-IDF公式不同。此外,当语料库非常小时,语料库中频繁出现的单词将被给予较高的TF-IDF分数。然而,在文档中频繁出现而在所有其他文档中很少见的单词应该是那些被给予高TF-IDF分数的单词。我非常确定,如果您从TfidfVectorizer(stop_words= 'english')中删除停用词过滤,您甚至会看到停用词出现在得分最高的词中;而TF-idf已知也被用于删除停用词,因为停用词是语料库中非常常见的术语,因此得分非常低(顺便说一句,停用词可能被认为是特定数据集(领域)的噪音,但也可能是另一个数据集(领域)的信息量很大的特征。因此,是否去除它们应以实验和结果分析为基础。此外,如果生成二元/三元语法,则停止字词消除将允许它们更好地匹配)。

如上所述,当语料库(文档集合)的大小相当小时会出现这种情况。在这种情况下,正如here所解释的,语料库的所有(在您的情况下是7个)文档中更有可能出现几个单词,因此都会以相同的方式受到惩罚(它们的idf值将是相同的)。例如,如果单词";Customer&Quot;在您的语料库中与";People&Quot;同时出现(即,两者出现在相同数量的文档中),则它们的idf值将相同;但是,频繁出现的单词(如未消除的停用词或您示例中的";People&Quot;)由于其词频较高tf,它们将获得比&Quot等词更高的TF-IDF分数,它也可能出现在每个文档中(作为示例),但词频较低。要演示这一点,请参见下面的使用skLearning的TfidfVectorizer(过滤故意选择退出的停用词语)。作为示例的数据来自here。返回得分最高的单词的函数基于这个article(我建议您看一下)。

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

df = pd.read_csv("Reviews.csv", usecols = ['Text'])
train = df.Text[:7]

#tfidf = TfidfVectorizer(analyzer= 'word', stop_words= 'english')
tfidf = TfidfVectorizer(analyzer= 'word')

Xtr = tfidf.fit_transform(train)
features = tfidf.get_feature_names_out()

 # Get top n tfidf values in row and return them with their corresponding feature names
def top_tfidf_feats(Xtr, features, row_id, top_n=10):
    row = np.squeeze(Xtr[row_id].toarray())  # convert the row into dense format first
    topn_ids = np.argsort(row)[::-1][:top_n] # produce the indices that would order the row by tf-idf value, reverse them (into descending order), and select the top_n
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(data=top_feats ,columns=['feature', 'tfidf'])
    return df

top_feats_D1 = top_tfidf_feats(Xtr, features, 0)
print("Top features in D1
", top_feats_D1, '
')

top_feats_D2 = top_tfidf_feats(Xtr, features, 1)
print("Top features in D2
", top_feats_D2, '
')

top_feats_D3 = top_tfidf_feats(Xtr, features, 2)
print("Top features in D3
", top_feats_D3, '
')
使用三种不同的训练(语料库)集合大小(即,n=7、n=100和n=1000),将从上面导出的结果与使用标准TF-IDF公式导出的结果进行比较。以下是使用标准公式计算TF-IDF的代码:

import math
from nltk.tokenize import word_tokenize

def tf(term, doc):
    terms = [term.lower() for term in word_tokenize(doc)]
    return terms.count(term) / len(terms)

def dft(term, corpus):
    return sum(1 for doc in corpus if term in [term.lower() for term in word_tokenize(doc)])

def idf(term, corpus):
    return math.log(len(corpus) /  dft(term, corpus))

def tfidf(term, doc, corpus):
    return tf(term, doc) * idf(term, corpus)

for i, doc in enumerate(train):
    if i==3: # print results for the first 3 doccuments only
        break
    print("Top features in D{}".format(i + 1))
    scores = {term.lower(): tfidf(term.lower(), doc, train) for term in word_tokenize(doc) if term.isalpha()} 
    sorted_terms = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    df_top_feats = pd.DataFrame()
    idx = 0
    for term, score in sorted_terms[:10]:
        df_top_feats.loc[idx, 'feature'] = term
        df_top_feats.loc[idx, 'tfidf'] = round(score, 5)
        idx+=1
    print(df_top_feats, '
')
下面的结果不言而喻。当只使用七个文档时,很明显得分最高的单词(下面只显示了前三个文档)中有几个停止单词。随着文档数量的增加,人们可以看到过多的常用单词(跨文档)将被删除,取而代之的是其他单词。有趣的是,如下所示,标准TF-IDF公式在消除频繁出现的术语方面做得更好,即使当语料库的大小相对较小时(即,n=7)。

因此,您可以通过实现您自己的函数(如上所述)来使用标准公式计算TF-IDF并查看它对您的效果,和/或增加语料库的大小(就文档而言),从而解决问题。您也可以尝试在TfidfVectorizer(smooth_idf=False, norm=None)中禁用平滑和/或规格化,但是,结果可能与您当前拥有的结果没有太大不同。希望这能有所帮助。

结果

            train = df.Text[:7]                                  train = df.Text[:100]                                   train = df.Text[:1000]
   Sklearn's Tf-Idf        Standard Tf-Idf             Sklearn's Tf-Idf           Standard Tf-Idf                Sklearn's Tf-Idf           Standard Tf-Idf

Top features in D1      Top features in D1          Top features in D1         Top features in D1            Top features in D1           Top features in D1
     feature     tfidf      feature    tfidf              feature     tfidf           feature   tfidf                feature     tfidf           feature    tfidf
0      than  0.301190   0      than  0.07631        0     better  0.275877     0     vitality  0.0903        0     vitality  0.263274     0     vitality  0.13545
1    better  0.301190   1    better  0.07631        1       than  0.243747     1       canned  0.0903        1  appreciates  0.263274     1     labrador  0.13545
2   product  0.250014   2      have  0.04913        2    product  0.229011     2        looks  0.0903        2     labrador  0.263274     2  appreciates  0.13545
3      have  0.250014   3   product  0.04913        3   vitality  0.211030     3         stew  0.0903        3         stew  0.248480     3         stew  0.12186
4       and  0.243790   4    bought  0.03816        4   labrador  0.211030     4    processed  0.0903        4      finicky  0.248480     4      finicky  0.12186
5        of  0.162527   5   several  0.03816        5       stew  0.211030     5         meat  0.0903        5       better  0.238212     5    processed  0.10826
6   quality  0.150595   6  vitality  0.03816        6      looks  0.211030     6       better  0.0903        6    processed  0.229842     6       canned  0.10031
7      meat  0.150595   7    canned  0.03816        7       meat  0.211030     7     labrador  0.0903        7       canned  0.217565     7       smells  0.10031
8  products  0.150595   8       dog  0.03816        8  processed  0.211030     8      finicky  0.0903        8       smells  0.217565     8         meat  0.09030
9    bought  0.150595   9      food  0.03816        9    finicky  0.211030     9  appreciates  0.0903        9         than  0.201924     9       better  0.08952
                                                                                                                                          
Top features in D2      Top features in D2          Top features in D2         Top features in D2            Top features in D2           Top features in D2
     feature     tfidf      feature    tfidf             feature     tfidf          feature    tfidf               feature     tfidf           feature    tfidf
0     jumbo  0.341277   0        as  0.10518        0     jumbo  0.411192      0      jumbo  0.24893         0      jumbo  0.491636       0      jumbo  0.37339
1   peanuts  0.341277   1     jumbo  0.10518        1   peanuts  0.377318      1    peanuts  0.21146         1    peanuts  0.389155       1    peanuts  0.26099
2        as  0.341277   2   peanuts  0.10518        2        if  0.232406      2    labeled  0.12446         2  represent  0.245818       2   intended  0.18670
3   product  0.283289   3   product  0.06772        3   product  0.223114      3     salted  0.12446         3   intended  0.245818       3  represent  0.18670
4       the  0.243169   4   arrived  0.05259        4        as  0.214753      4   unsalted  0.12446         4      error  0.232005       4    labeled  0.16796
5        if  0.210233   5   labeled  0.05259        5    salted  0.205596      5      error  0.12446         5    labeled  0.232005       5      error  0.16796
6  actually  0.170638   6    salted  0.05259        6  intended  0.205596      6     vendor  0.12446         6     vendor  0.208391       6     vendor  0.14320
7      sure  0.170638   7  actually  0.05259        7    vendor  0.205596      7   intended  0.12446         7   unsalted  0.198590       7   unsalted  0.13410
8     small  0.170638   8     small  0.05259        8   labeled  0.205596      8  represent  0.12446         8    product  0.186960       8     salted  0.12446
9     sized  0.170638   9     sized  0.05259        9  unsalted  0.205596      9    product  0.10628         9     salted  0.184777       9      sized  0.11954 
                                                                                                                                          
Top features in D3      Top features in D3          Top features in D3         Top features in D3            Top features in D3           Top features in D3
   feature     tfidf          feature    tfidf          feature     tfidf            feature    tfidf             feature     tfidf             feature    tfidf
0     and  0.325182     0        that  0.03570      0    witch  0.261635       0       witch  0.08450        0     witch  0.311210        0       witch  0.12675
1     the  0.286254     1        into  0.03570      1     tiny  0.240082       1        tiny  0.07178        1      tiny  0.224307        1        tiny  0.07832
2      is  0.270985     2        tiny  0.03570      2    treat  0.224790       2       treat  0.06434        2     treat  0.205872        2       treat  0.07089
3    with  0.250113     3       witch  0.03570      3     into  0.203237       3        into  0.05497        3      into  0.192997        3        into  0.06434
4    that  0.200873     4        with  0.03448      4      the  0.200679       4  confection  0.04225        4        is  0.165928        4  confection  0.06337
5    into  0.200873     5       treat  0.02299      5       is  0.195614       5   centuries  0.04225        5       and  0.156625        5   centuries  0.06337
6   witch  0.200873     6         and  0.01852      6      and  0.183265       6       light  0.04225        6      lion  0.155605        6     pillowy  0.06337
7    tiny  0.200873     7  confection  0.01785      7     with  0.161989       7     pillowy  0.04225        7    edmund  0.155605        7     gelatin  0.06337
8    this  0.168355     8         has  0.01785      8     this  0.154817       8      citrus  0.04225        8   seduces  0.155605        8    filberts  0.06337
9   treat  0.166742     9        been  0.01785      9  pillowy  0.130818       9     gelatin  0.04225        9  filberts  0.155605        9   liberally  0.06337 

    

这篇关于TfidfVectorizer似乎给出了不正确的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆