NLTK:如何从csv文件创建语料库 [英] NLTK: How to create a corpus from csv file

查看:419
本文介绍了NLTK:如何从csv文件创建语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件

col1         col2      col3

some text    someID    some value
some text    someID    some value

在每一行中,col1对应于整个文档的文本.我想从此csv创建一个语料库.我的目标是使用sklearn的TfidfVectorizer计算文档相似度和关键字提取.所以考虑

in each row, col1 corresponds to the text of an entire document. I would like to create a corpus from this csv. my aim is to use sklearn's TfidfVectorizer to compute document similarity and keyword extraction. So consider

tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
tfs = tfidf.fit_transform(<my corpus here>)

这样我就可以使用

str = 'here is some text from a new document'
response = tfidf.transform([str])
feature_names = tfidf.get_feature_names()
for col in response.nonzero()[1]:
    print feature_names[col], ' - ', response[0, col]

如何使用nltk创建语料库? 语料库应采用哪种形式/数据结构,以便可以将其提供给转换函数?

how do i create a corpus using nltk? what form/data structure should the corpus be so that it can be supplied to the transform function?

推荐答案

pandas库中检出read_csv.这里是文档: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Check out read_csv from the pandas library. Here is the documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

您可以通过在命令行上运行pip install pandas来安装熊猫.然后加载csv并选择该列应如下所示:

You can install pandas by running pip install pandas at the command line. Then loading the csv and selecting that column should be as easy as the below:

data = pd.read_csv(path_to_csv)
docs = data['col1']

tfs = tfidf.fit_transform(docs)

这篇关于NLTK:如何从csv文件创建语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆