我可以在scikit-learn中使用CountVectorizer来计数未用于提取令牌的文档的频率吗? [英] Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?
问题描述
我一直在使用scikit-learn中的CountVectorizer
类.
I have been working with the CountVectorizer
class in scikit-learn.
我知道,如果按照以下所示的方式使用,则最终输出将由一个包含特征或标记计数的数组组成.
I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.
这些令牌是从一组关键字中提取的,即
These tokens are extracted from a set of keywords, i.e.
tags = [
"python, tools",
"linux, tools, ubuntu",
"distributed systems, linux, networking, tools",
]
下一步是:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data
我们到哪里
[[0 0 0 1 1 0]
[0 1 0 0 1 1]
[1 1 1 0 1 0]]
很好,但是我的情况略有不同.
This is fine, but my situation is just a little bit different.
我希望以与上述相同的方式提取特征,但是我不希望data
中的行与提取特征的文档相同.
I want to extract the features the same way as above, but I don't want the rows in data
to be the same documents that the features were extracted from.
换句话说,我如何获得另一组文档的数量,例如
In other words, how can I get counts of another set of documents, say,
list_of_new_documents = [
["python, chicken"],
["linux, cow, ubuntu"],
["machine learning, bird, fish, pig"]
]
并获得:
[[0 0 0 1 0 0]
[0 1 0 0 0 1]
[0 0 0 0 0 0]]
我已经阅读了CountVectorizer
类的文档,并遇到了vocabulary
参数,该参数是术语到要素索引的映射.但是,我似乎无法获得这种论点来帮助我.
I have read the documentation for the CountVectorizer
class, and came across the vocabulary
argument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.
任何建议都值得赞赏.
PS:由于 Matthias Friedrich的博客而获得的所有功劳我上面使用的示例.
Any advice is appreciated.
PS: all credit due to Matthias Friedrich's Blog for the example I used above.
推荐答案
您说对了,vocabulary
是您想要的.它是这样的:
You're right that vocabulary
is what you want. It works like this:
>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
[0, 0, 1]], dtype=int64)
因此,您将所需功能作为键传递给它.
So you pass it a dict with your desired features as the keys.
如果在一组文档上使用了CountVectorizer
,然后又想将这些文档中的功能集用于新的一组,请使用原始CountVectorizer的vocabulary_
属性,并将其传递给新的CountVectorizer.因此,在您的示例中,您可以
If you used CountVectorizer
on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_
attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do
newVec = CountVectorizer(vocabulary=vec.vocabulary_)
使用您的第一个词汇创建新的令牌生成器.
to create a new tokenizer using the vocabulary from your first one.
这篇关于我可以在scikit-learn中使用CountVectorizer来计数未用于提取令牌的文档的频率吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!