我可以在scikit-learn中使用CountVectorizer来计数未用于提取令牌的文档的频率吗? [英] Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

查看:177
本文介绍了我可以在scikit-learn中使用CountVectorizer来计数未用于提取令牌的文档的频率吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用scikit-learn中的CountVectorizer类.

I have been working with the CountVectorizer class in scikit-learn.

我知道,如果按照以下所示的方式使用,则最终输出将由一个包含特征或标记计数的数组组成.

I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.

这些令牌是从一组关键字中提取的,即

These tokens are extracted from a set of keywords, i.e.

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

下一步是:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data

我们到哪里

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

很好,但是我的情况略有不同.

This is fine, but my situation is just a little bit different.

我希望以与上述相同的方式提取特征,但是我不希望data中的行与提取特征的文档相同.

I want to extract the features the same way as above, but I don't want the rows in data to be the same documents that the features were extracted from.

换句话说,我如何获得另一组文档的数量,例如

In other words, how can I get counts of another set of documents, say,

list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]
]

并获得:

[[0 0 0 1 0 0]
 [0 1 0 0 0 1]
 [0 0 0 0 0 0]]

我已经阅读了CountVectorizer类的文档,并遇到了vocabulary参数,该参数是术语到要素索引的映射.但是,我似乎无法获得这种论点来帮助我.

I have read the documentation for the CountVectorizer class, and came across the vocabulary argument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.

任何建议都值得赞赏.
PS:由于 Matthias Friedrich的博客而获得的所有功劳我上面使用的示例.

Any advice is appreciated.
PS: all credit due to Matthias Friedrich's Blog for the example I used above.

推荐答案

您说对了,vocabulary是您想要的.它是这样的:

You're right that vocabulary is what you want. It works like this:

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 1]], dtype=int64)

因此,您将所需功能作为键传递给它.

So you pass it a dict with your desired features as the keys.

如果在一组文档上使用了CountVectorizer,然后又想将这些文档中的功能集用于新的一组,请使用原始CountVectorizer的vocabulary_属性,并将其传递给新的CountVectorizer.因此,在您的示例中,您可以

If you used CountVectorizer on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_ attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

使用您的第一个词汇创建新的令牌生成器.

to create a new tokenizer using the vocabulary from your first one.

这篇关于我可以在scikit-learn中使用CountVectorizer来计数未用于提取令牌的文档的频率吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆