有没有办法将nltk功能集转换为scipy.sparse数组? [英] Is there a way to convert nltk featuresets into a scipy.sparse array?
问题描述
我正在尝试使用scikit.learn,它需要numpy/scipy数组作为输入. 在nltk中生成的特征集由单音和双频频率组成.我可以手动完成此操作,但这会很费力.因此,想知道是否有我忽略的解决方案.
I'm trying to use scikit.learn which needs numpy/scipy arrays for input. The featureset generated in nltk consists of unigram and bigram frequencies. I could do it manually, but that'll be a lot of effort. So wondering if there's a solution i've overlooked.
推荐答案
我不知道,但是请注意scikit-learn可以自己进行 n -gram频率计数.假设单词级 n -grams:
Not that I know of, but note that scikit-learn can do n-gram frequency counting itself. Assuming word-level n-grams:
from sklearn.feature_extraction.text import CountVectorizer, WordNGramAnalyzer
v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2))
X = v.fit_transform(files)
其中,files
是字符串或类似文件的对象的列表.之后,X
是原始频率计数的稀疏矩阵.
where files
is a list of strings or file-like objects. After this, X
is a scipy.sparse matrix of raw frequency counts.
这篇关于有没有办法将nltk功能集转换为scipy.sparse数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!