带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误 [英] ColumnTransformer with TfidfVectorizer produces "empty vocabulary" error
问题描述
我正在使用 ColumnTransformer
运行一个非常简单的实验,目的是转换一个列数组,在本例中为 ["a"]:
I am running a very simple experiment with ColumnTransformer
with an intent to transform an array of columns, ["a"] in this example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
dataset = pd.DataFrame({"a":["word gone wild","gone with wind"],"c":[1,2]})
tfidf = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf", tfidf, ["a"])],remainder="passthrough")
clmn.fit_transform(dataset)
这给了我:
ValueError: empty vocabulary; perhaps the documents only contain stop words
显然,TfidfVectorizer
可以自己做 fit_transform()
:
Obviously, TfidfVectorizer
can do fit_transform()
on its own:
tfidf.fit_transform(dataset.a)
<2x5 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
出现这种错误的原因可能是什么以及如何纠正它?
What could be a reason for such an error and how to correct for it?
推荐答案
那是因为你在 中提供了
.根据文档:["a"]
而不是 "a"
列转换器
That's because you are providing ["a"]
instead of "a"
in ColumnTransformer
. According to the documentation:
在转换器期望 X 为一维数组(向量)的情况下,应使用标量字符串或整数,否则将向转换器传递二维数组.
A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer.
现在,TfidfVectorizer
需要一个字符串迭代器用于输入(因此是一维字符串数组).但是由于您要在 ColumnTransformer
中发送列名列表(即使该列表仅包含一列),它将是将传递给 TfidfVectorizer
.因此错误.
Now, TfidfVectorizer
requires a single iterator of strings for input (so a 1-d array of strings). But since you are sending a list of column names in ColumnTransformer
(even though that list only contains a single column), it will be 2-d array that will be passed to TfidfVectorizer
. And hence the error.
将其更改为:
clmn = ColumnTransformer([("tfidf", tfidf, "a")],
remainder="passthrough")
为了更多的理解,尝试使用上面的东西从一个 Pandas DataFrame 中选择数据.执行时检查返回数据的格式(dtype、shape):
For more understanding, try using the above things to select data from a pandas DataFrame. Check the format (dtype, shape) of the returned data when you do:
dataset['a']
vs
dataset[['a']]
更新:@SergeyBushmanov,关于您对另一个答案的评论,我认为您误解了文档.如果你想在两列上做 tfidf,那么你需要传递两个变压器.像这样:
Update: @SergeyBushmanov, Regarding your comment on the other answer, I think that you are misinterpreting the documentation. If you want to do tfidf on two columns, then you need to pass two transformers. Something like this:
tfidf_1 = TfidfVectorizer(min_df=0)
tfidf_2 = TfidfVectorizer(min_df=0)
clmn = ColumnTransformer([("tfidf_1", tfidf_1, "a"),
("tfidf_2", tfidf_2, "b")
],
remainder="passthrough")
这篇关于带有 TfidfVectorizer 的 ColumnTransformer 产生“空词汇"错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!