tfidf.transform() 函数没有返回正确的值 [英] tfidf.transform() function not returning correct values

查看:34
本文介绍了tfidf.transform() 函数没有返回正确的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在某个文本语料库上拟合 tfidf 向量化器,然后使用相同的向量化器来查找新文本的 tfidf 值的总和.但是,总和值并不符合预期.示例如下:

I am trying to fit tfidf vectorizer on a certain text corpus and then use the same vectorizer to find the sum of tfidf values of the new text.However, the sum values are not as expected. Below is the example:

text = ["I am new to python and R , how can anyone help me","why is no one able to crack the python code without help"]
tf= TfidfVectorizer(stop_words='english',ngram_range =(1,1))
tf.fit_transform(text)
zip(tf.get_feature_names(),tf.idf_)

[(u'able', 1.4054651081081644),
 (u'code', 1.4054651081081644),
 (u'crack', 1.4054651081081644),
 (u'help', 1.0),
 (u'new', 1.4054651081081644),
 (u'python', 1.0)]

现在,当我使用新文本尝试相同的 tf 时:

Now when i try the same tf with new text:

new_text = "i am not able to code"
np.sum(tf.transform([new_text]))
1.4142135623730951

我预计输出将在 2.80 左右.任何关于此处可能出现问题的建议都会非常有帮助.

I am expecting the output to be around 2.80.any suggestion on what might be going wrong here would be really helpful.

推荐答案

这是因为l2 规范化"(TfidfVectorizer 中的默认设置).如您所料,transform() 的第一个结果是:

This is because of the 'l2 normalization' (default in TfidfVectorizer). As you expect, the first result of the transform() is:

array([[ 1.40546511,  1.40546511,  0.        ,  0.        ,  0.        ,
     0.        ]])

但是现在规范化完成了.在此,上述向量除以除法器:

But now the normalization is done. In this, the above vector is divided by the divider:

dividor = sqrt(sqr(1.40546511)+sqr(1.40546511)+sqr(0)+sqr(0)+sqr(0)+sqr(0))
        = sqrt(1.975332175+1.975332175+0+0+0+0)
        = 1.98762782

所以最终得到的数组是:

So the resulting final array is:

array([[ 0.70710678,  0.70710678,  0.        ,  0.        ,  0.        ,
     0.        ]])

然后你应用 sum,它的结果是 = 1.4142135623730951.

And then you apply sum, its result is = 1.4142135623730951.

希望现在清楚了.你可以参考我的回答这里的完整工作的 TfidfVectorizer.

Hope it is clear now. You can refer to my answer here for complete working of TfidfVectorizer.

这篇关于tfidf.transform() 函数没有返回正确的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆