在NLTK和Python中创建自定义分类的语料库 [英] Creating a custom categorized corpus in NLTK and Python
问题描述
我遇到了一些问题,这与正则表达式和Python中的CategorizedPlaintextCorpusReader
有关.
I'm experiencing a bit of a problem which has to do with regular expressions and CategorizedPlaintextCorpusReader
in Python.
我想创建一个自定义分类的语料库,并在其上训练一个Naive-Bayes分类器.我的问题如下:我想有两个类别,"pos"和"neg".正文件全部位于一个目录main_dir/pos/*.txt
中,负文件位于单独的目录main_dir/neg/*.txt
中.
I want to create a custom categorized corpus and train a Naive-Bayes classifier on it. My issue is the following: I want to have two categories, "pos" and "neg". The positive files are all in one directory, main_dir/pos/*.txt
, and the negative ones are in a separate directory, main_dir/neg/*.txt
.
如何使用CategorizedPlaintextCorpusReader
加载并标记pos目录中的所有肯定文件,并对否定文件执行相同的操作?
How can I use the CategorizedPlaintextCorpusReader
to load and label all the positive files in the pos directory, and do the same for the negative ones?
注意:设置与Movie_reviews
语料库(~nltk_data\corpora\movie_reviews
)完全相同.
NB: The setup is absolutely the same as the Movie_reviews
corpus (~nltk_data\corpora\movie_reviews
).
推荐答案
以下是我的问题的答案. 由于我一直在考虑使用两种情况,所以我认为最好覆盖两种情况,以防将来有人需要答案. 如果您具有与movie_review语料库相同的设置-几个以相同的方式标记了标签的文件夹,则您希望调用标签并包含训练数据,可以使用此文件夹.
Here is the answer to my question. Since I was thinking about using two cases I think it's good to cover both in case someone needs the answer in the future. If you have the same setup as the movie_review corpus - several folders labeled in the same way you would like your labels to be called and containing the training data you can use this.
reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'(\w+)/*')
我正在考虑的另一种方法是将所有内容放在一个文件夹中,并命名文件0_neg.txt,0_pos.txt,1_neg.txt等.您的阅读器代码应类似于:
The other approach that I was considering is putting everything in a single folder and naming the files 0_neg.txt, 0_pos.txt, 1_neg.txt etc. The code for your reader should look something like:
reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'\d+_(\w+)\.txt')
我希望这会在将来对某人有所帮助.
I hope that this would help someone in the future.
这篇关于在NLTK和Python中创建自定义分类的语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!