将单词添加到 scikit-learn 的 CountVectorizer 的停止列表 [英] Adding words to scikit-learn's CountVectorizer's stop list
问题描述
Scikit-learn 的 CountVectorizer 类允许您传递字符串 'english' 到参数 stop_words.我想在这个预定义列表中添加一些内容.谁能告诉我如何做到这一点?
Scikit-learn's CountVectorizer class lets you pass a string 'english' to the argument stop_words. I want to add some things to this predefined list. Can anyone tell me how to do this?
推荐答案
根据源代码 sklearn.feature_extraction.text
,完整列表(实际上是一个 frozenset
,来自 stop_words
) 的 ENGLISH_STOP_WORDS
通过 __all__
公开.因此,如果您想使用该列表以及更多项目,您可以执行以下操作:
According to the source code for sklearn.feature_extraction.text
, the full list (actually a frozenset
, from stop_words
) of ENGLISH_STOP_WORDS
is exposed through __all__
. Therefore if you want to use that list plus some more items, you could do something like:
from sklearn.feature_extraction import text
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)
(其中 my_additional_stop_words
是任何字符串序列)并将结果用作 stop_words
参数.CountVectorizer.__init__
的这个输入由 _check_stop_list
解析,它将直接传递新的 frozenset
.
(where my_additional_stop_words
is any sequence of strings) and use the result as the stop_words
argument. This input to CountVectorizer.__init__
is parsed by _check_stop_list
, which will pass the new frozenset
straight through.
这篇关于将单词添加到 scikit-learn 的 CountVectorizer 的停止列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!