countvectorizer 是否与 use_idf=false 的 tfidfvectorizer 相同? [英] Is a countvectorizer the same as tfidfvectorizer with use_idf=false?

查看:48
本文介绍了countvectorizer 是否与 use_idf=false 的 tfidfvectorizer 相同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如标题所述:countvectorizer 是否与 tfidfvectorizer 相同且带有 use_idf=false ?如果不是为什么不呢?

那么这是否也意味着在此处添加 tfidftransformer 是多余的?

vect = CountVectorizer(min_df=1)tweets_vector = vect.fit_transform(语料库)tf_transformer = TfidfTransformer(use_idf=False).fit(tweets_vector)tweets_vector_tf = tf_transformer.transform(tweets_vector)

解决方案

不,它们不一样.TfidfVectorizer 对其结果进行归一化,即其输出中的每个向量都具有范数 1:

<预><代码>>>>CountVectorizer().fit_transform(["foo bar baz", "foo bar quux"]).A数组([[1, 1, 1, 0],[1, 0, 1, 1]])>>>TfidfVectorizer(use_idf=False).fit_transform(["foo bar baz", "foo bar quux"]).A数组([[ 0.57735027, 0.57735027, 0.57735027, 0. ],[ 0.57735027, 0. , 0.57735027, 0.57735027]])

这样做是为了使行上的点积是余弦相似度.当给定选项 sublinear_tf=True 时,TfidfVectorizer 也可以使用对数折扣频率.

要使 TfidfVectorizer 表现得像 CountVectorizer,请为其提供构造函数选项 use_idf=False, normalize=None.

As the title states: Is a countvectorizer the same as tfidfvectorizer with use_idf=false ? If not why not ?

So does this also mean that adding the tfidftransformer here is redundant ?

vect = CountVectorizer(min_df=1)
tweets_vector = vect.fit_transform(corpus)
tf_transformer = TfidfTransformer(use_idf=False).fit(tweets_vector)
tweets_vector_tf = tf_transformer.transform(tweets_vector)

解决方案

No, they're not the same. TfidfVectorizer normalizes its results, i.e. each vector in its output has norm 1:

>>> CountVectorizer().fit_transform(["foo bar baz", "foo bar quux"]).A
array([[1, 1, 1, 0],
       [1, 0, 1, 1]])
>>> TfidfVectorizer(use_idf=False).fit_transform(["foo bar baz", "foo bar quux"]).A
array([[ 0.57735027,  0.57735027,  0.57735027,  0.        ],
       [ 0.57735027,  0.        ,  0.57735027,  0.57735027]])

This is done so that dot-products on the rows are cosine similarities. Also TfidfVectorizer can use logarithmically discounted frequencies when given the option sublinear_tf=True.

To make TfidfVectorizer behave as CountVectorizer, give it the constructor options use_idf=False, normalize=None.

这篇关于countvectorizer 是否与 use_idf=false 的 tfidfvectorizer 相同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆