在NLTK和Python中创建自定义分类的语料库 [英] Creating a custom categorized corpus in NLTK and Python

查看:200
本文介绍了在NLTK和Python中创建自定义分类的语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一些问题,这与正则表达式和Python中的CategorizedPlaintextCorpusReader有关.

I'm experiencing a bit of a problem which has to do with regular expressions and CategorizedPlaintextCorpusReader in Python.

我想创建一个自定义分类的语料库,并在其上训练一个Naive-Bayes分类器.我的问题如下:我想有两个类别,"pos"和"neg".正文件全部位于一个目录main_dir/pos/*.txt中,负文件位于单独的目录main_dir/neg/*.txt中.

I want to create a custom categorized corpus and train a Naive-Bayes classifier on it. My issue is the following: I want to have two categories, "pos" and "neg". The positive files are all in one directory, main_dir/pos/*.txt, and the negative ones are in a separate directory, main_dir/neg/*.txt.

如何使用CategorizedPlaintextCorpusReader加载并标记pos目录中的所有肯定文件,并对否定文件执行相同的操作?

How can I use the CategorizedPlaintextCorpusReader to load and label all the positive files in the pos directory, and do the same for the negative ones?

注意:设置与Movie_reviews语料库(~nltk_data\corpora\movie_reviews)完全相同.

NB: The setup is absolutely the same as the Movie_reviews corpus (~nltk_data\corpora\movie_reviews).

推荐答案

以下是我的问题的答案. 由于我一直在考虑使用两种情况,所以我认为最好覆盖两种情况,以防将来有人需要答案. 如果您具有与movie_review语料库相同的设置-几个以相同的方式标记了标签的文件夹,则您希望调用标签并包含训练数据,可以使用此文件夹.

Here is the answer to my question. Since I was thinking about using two cases I think it's good to cover both in case someone needs the answer in the future. If you have the same setup as the movie_review corpus - several folders labeled in the same way you would like your labels to be called and containing the training data you can use this.

reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'(\w+)/*')

我正在考虑的另一种方法是将所有内容放在一个文件夹中,并命名文件0_neg.txt,0_pos.txt,1_neg.txt等.您的阅读器代码应类似于:

The other approach that I was considering is putting everything in a single folder and naming the files 0_neg.txt, 0_pos.txt, 1_neg.txt etc. The code for your reader should look something like:

reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'\d+_(\w+)\.txt')

我希望这会在将来对某人有所帮助.

I hope that this would help someone in the future.

这篇关于在NLTK和Python中创建自定义分类的语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆