带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误 [英] ColumnTransformer with TfidfVectorizer produces "empty vocabulary" error

查看:43
本文介绍了带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 ColumnTransformer 运行一个非常简单的实验,目的是转换一个列数组,在本例中为 ["a"]:

I am running a very simple experiment with ColumnTransformer with an intent to transform an array of columns, ["a"] in this example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],"c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf", tfidf, ["a"])],remainder="passthrough")
clmn.fit_transform(dataset)

这给了我:

ValueError: empty vocabulary; perhaps the documents only contain stop words

显然,TfidfVectorizer 可以自己做 fit_transform() :

Obviously, TfidfVectorizer can do fit_transform() on its own:

tfidf.fit_transform(dataset.a)
<2x5 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

出现这种错误的原因可能是什么以及如何纠正它?

What could be a reason for such an error and how to correct for it?

推荐答案

那是因为你在 中提供了 ["a"] 而不是 "a"列转换器.根据文档:

That's because you are providing ["a"] instead of "a" in ColumnTransformer. According to the documentation:

在转换器期望 X 为一维数组(向量)的情况下,应使用标量字符串或整数,否则将向转换器传递二维数组.

A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.

现在,TfidfVectorizer 需要一个字符串迭代器用于输入(因此是一维字符串数组).但是由于您要在 ColumnTransformer 中发送列名列表(即使该列表仅包含一列),它将是将传递给 TfidfVectorizer.因此错误.

Now, TfidfVectorizer requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer (even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer. And hence the error.

将其更改为:

clmn = ColumnTransformer([("tfidf", tfidf, "a")],
                         remainder="passthrough")

为了更多的理解,尝试使用上面的东西从一个 Pandas DataFrame 中选择数据.执行时检查返回数据的格式(dtype、shape):

For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:

dataset['a']

vs 

dataset[['a']]

更新:@SergeyBushmanov,关于您对另一个答案的评论,我认为您误解了文档.如果你想在两列上做 tfidf,那么你需要传递两个变压器.像这样:

Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:

tfidf_1 = TfidfVectorizer(min_df=0)
tfidf_2 = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"), 
                          ("tfidf_2", tfidf_2, "b")
                         ],
                         remainder="passthrough")

这篇关于带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆