带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误 [英] ColumnTransformer with TfidfVectorizer produces "empty vocabulary" error

查看：43 发布时间：2021/7/16 19:51:04 python scikit-learn

本文介绍了带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用 ColumnTransformer 运行一个非常简单的实验，目的是转换一个列数组，在本例中为 ["a"]:

I am running a very simple experiment with ColumnTransformer with an intent to transform an array of columns, ["a"] in this example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],"c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf", tfidf, ["a"])],remainder="passthrough")
clmn.fit_transform(dataset)

这给了我:

ValueError: empty vocabulary; perhaps the documents only contain stop words

显然，TfidfVectorizer 可以自己做 fit_transform() :

Obviously, TfidfVectorizer can do fit_transform() on its own:

tfidf.fit_transform(dataset.a)
<2x5 sparse matrix of type '<class 'numpy.float64'>'
    with 6 stored elements in Compressed Sparse Row format>

出现这种错误的原因可能是什么以及如何纠正它?

What could be a reason for such an error and how to correct for it?

推荐答案

那是因为你在 中提供了 ["a"] 而不是 "a"列转换器.根据文档:

That's because you are providing ["a"] instead of "a" in ColumnTransformer. According to the documentation:

在转换器期望 X 为一维数组(向量)的情况下，应使用标量字符串或整数，否则将向转换器传递二维数组.

A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.

现在，TfidfVectorizer 需要一个字符串迭代器用于输入(因此是一维字符串数组).但是由于您要在 ColumnTransformer 中发送列名列表(即使该列表仅包含一列)，它将是将传递给 TfidfVectorizer.因此错误.

Now, TfidfVectorizer requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer (even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer. And hence the error.

将其更改为:

clmn = ColumnTransformer([("tfidf", tfidf, "a")],
                         remainder="passthrough")

为了更多的理解，尝试使用上面的东西从一个 Pandas DataFrame 中选择数据.执行时检查返回数据的格式(dtype、shape):

For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:

dataset['a']

vs 

dataset[['a']]

更新:@SergeyBushmanov，关于您对另一个答案的评论，我认为您误解了文档.如果你想在两列上做 tfidf，那么你需要传递两个变压器.像这样:

Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:

tfidf_1 = TfidfVectorizer(min_df=0)
tfidf_2 = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"), 
                          ("tfidf_2", tfidf_2, "b")
                         ],
                         remainder="passthrough")

这篇关于带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误 [英] ColumnTransformer with TfidfVectorizer produces "empty vocabulary" error

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误 [英] ColumnTransformer with TfidfVectorizer produces &quot;empty vocabulary&quot; error

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误 [英] ColumnTransformer with TfidfVectorizer produces "empty vocabulary" error

登录关闭