我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗? [英] Can I use CountVectorizer in scikit-learn to count frequency of documents that were not used to extract the tokens?

查看:20
本文介绍了我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用 scikit-learn 中的 CountVectorizer 类.

我知道如果以下面所示的方式使用,最终输出将包含一个包含特征计数或标记的数组.

这些标记是从一组关键字中提取的,即

标签 = [蟒蛇,工具","linux, 工具, ubuntu",分布式系统、Linux、网络、工具",]

下一步是:

from sklearn.feature_extraction.text import CountVectorizervec = CountVectorizer(tokenizer=tokenize)数据 = vec.fit_transform(tags).toarray()打印数据

我们在哪里

[[0 0 0 1 1 0][0 1 0 0 1 1][1 1 1 0 1 0]]

这很好,但我的情况有点不同.

我想以与上述相同的方式提取特征,但我不希望 data 中的行与提取特征的文档相同.

换句话说,我怎样才能获得另一组文档的计数,比如,

list_of_new_documents = [[蟒蛇,鸡"],["linux, 牛, ubuntu"],【《机器学习,鸟,鱼,猪》】]

并得到:

[[0 0 0 1 0 0][0 1 0 0 0 1][0 0 0 0 0 0]]

我已经阅读了 CountVectorizer 类的文档,并且遇到了 vocabulary 参数,它是术语到特征索引的映射.然而,我似乎无法得到这个论点的帮助.

感谢任何建议.
PS:所有功劳均归功于 Matthias Friedrich 的博客我上面用的例子.

解决方案

你说得对,vocabulary 就是你想要的.它是这样工作的:

<预><代码>>>>cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])>>>cv.fit_transform(['豌豆粥热','豌豆粥冷','锅里豌豆粥','九天大']).toarray()数组([[1, 0, 0],[0, 1, 0],[0, 0, 0],[0, 0, 1]], dtype=int64)

所以你向它传递一个带有你想要的特征作为键的字典.

如果您在一组文档上使用了 CountVectorizer,然后您想将这些文档中的一组功能用于新的集合,请使用您的 vocabulary_ 属性原始 CountVectorizer 并将其传递给新的.所以在你的例子中,你可以做

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

使用第一个词汇表创建一个新的分词器.

I have been working with the CountVectorizer class in scikit-learn.

I understand that if used in the manner shown below, the final output will consist of an array containing counts of features, or tokens.

These tokens are extracted from a set of keywords, i.e.

tags = [
  "python, tools",
  "linux, tools, ubuntu",
  "distributed systems, linux, networking, tools",
]

The next step is:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(tokenizer=tokenize)
data = vec.fit_transform(tags).toarray()
print data

Where we get

[[0 0 0 1 1 0]
 [0 1 0 0 1 1]
 [1 1 1 0 1 0]]

This is fine, but my situation is just a little bit different.

I want to extract the features the same way as above, but I don't want the rows in data to be the same documents that the features were extracted from.

In other words, how can I get counts of another set of documents, say,

list_of_new_documents = [
  ["python, chicken"],
  ["linux, cow, ubuntu"],
  ["machine learning, bird, fish, pig"]
]

And get:

[[0 0 0 1 0 0]
 [0 1 0 0 0 1]
 [0 0 0 0 0 0]]

I have read the documentation for the CountVectorizer class, and came across the vocabulary argument, which is a mapping of terms to feature indices. I can't seem to get this argument to help me, however.

Any advice is appreciated.
PS: all credit due to Matthias Friedrich's Blog for the example I used above.

解决方案

You're right that vocabulary is what you want. It works like this:

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old'])
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray()
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 0],
       [0, 0, 1]], dtype=int64)

So you pass it a dict with your desired features as the keys.

If you used CountVectorizer on one set of documents and then you want to use the set of features from those documents for a new set, use the vocabulary_ attribute of your original CountVectorizer and pass it to the new one. So in your example, you could do

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

to create a new tokenizer using the vocabulary from your first one.

这篇关于我可以在 scikit-learn 中使用 CountVectorizer 来计算未用于提取标记的文档的频率吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆