有没有办法将nltk功能集转换为scipy.sparse数组? [英] Is there a way to convert nltk featuresets into a scipy.sparse array?

查看:77
本文介绍了有没有办法将nltk功能集转换为scipy.sparse数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用scikit.learn,它需要numpy/scipy数组作为输入. 在nltk中生成的特征集由单音和双频频率组成.我可以手动完成此操作,但这会很费力.因此,想知道是否有我忽略的解决方案.

I'm trying to use scikit.learn which needs numpy/scipy arrays for input. The featureset generated in nltk consists of unigram and bigram frequencies. I could do it manually, but that'll be a lot of effort. So wondering if there's a solution i've overlooked.

推荐答案

我不知道,但是请注意scikit-learn可以自己进行 n -gram频率计数.假设单词级 n -grams:

Not that I know of, but note that scikit-learn can do n-gram frequency counting itself. Assuming word-level n-grams:

from sklearn.feature_extraction.text import CountVectorizer, WordNGramAnalyzer
v = CountVectorizer(analyzer=WordNGramAnalyzer(min_n=1, max_n=2))
X = v.fit_transform(files)

其中,files是字符串或类似文件的对象的列表.之后,X是原始频率计数的稀疏矩阵.

where files is a list of strings or file-like objects. After this, X is a scipy.sparse matrix of raw frequency counts.

这篇关于有没有办法将nltk功能集转换为scipy.sparse数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆