如何将sklearn CountVectorizer与多个字符串结合使用? [英] How can I use sklearn CountVectorizer with mutliple strings?

查看:220
本文介绍了如何将sklearn CountVectorizer与多个字符串结合使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串列表(10,000个).一些字符串构成多个单词.我还有另一个包含一些句子的清单.我正在尝试计算列表中每个字符串出现在每个句子中的次数.

I have a list of strings (10,000s). Some of the strings constitute multiple words. I have another list which contains some sentences. I am trying to do a count of the number of times each string in my list appears in each sentence.

目前,我正在使用sklearn的特征提取工具,因为当我们要查找10,000个字符串和10,000个句子时,它的工作速度非常快.

At present I am using sklearn's feature extraction tool, because it works very quickly when we have 10,000s of strings to look up and 10,000s of sentences.

下面是我的代码的简化版本.

Below is a simplified version of my code.

import numpy as np
from sklearn import feature_extraction

sentences = ["hi brown cow", "red ants", "fierce fish"]

listOfStrings = ["brown cow", "ants", "fish"]

cv = feature_extraction.text.CountVectorizer(vocabulary=listOfStrings)
taggedSentences = cv.fit_transform(sentences).toarray()

taggedSentencesCutDown = taggedSentences > 0
# Here we get an array of tuples <sentenceIndex, stringIndexfromStringList>
taggedSentencesCutDown = np.column_stack(np.where(taggedSentencesCutDown))

目前,如果运行此命令,则输出如下:

at the moment, if you run this the output is the following:

In [2]: taggedSentencesCutDown
Out[2]: array([[1, 1], [2, 2]])

我想要的是:

In [2]: taggedSentencesCutDown
Out[2]: array([[0,0], [1, 1], [2, 2]])

我目前对CountVectorizer的使用表明它没有在寻找多个单词字符串.还有其他方法可以做到这一点,而无需进入冗长的for循环.效率和时间对我的应用程序来说非常重要,因为我的列表在10,000s之内.

My current use of CountVectorizer shows that it is not looking for multiple word strings. Is there someway else to do this without going into long for loops. Efficiency and time are quite important for my app as my lists are in the 10,000s.

谢谢

推荐答案

我设法通过使用CountVectorizer中的n-grams参数来解决此问题.

I managed to solve this by playing with the n-grams parameter in the CountVectorizer.

如果我能够在单词表中找到单个字符串中最多的单词,则可以将其设置为我的n-gram的上限.在上面的示例中,它是带有两个的棕牛".

If I am able to find the largest number of words a single string in my wordlist I can set this as the upper limit to my n-gram. In the example above it is "brown cow" with two.

cv = feature_extraction.text.CountVectorizer(vocabulary=listOfStrings,
       ngram_range=(1, 2))

这篇关于如何将sklearn CountVectorizer与多个字符串结合使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆