在 NLTK 和 Python 中创建自定义分类语料库 [英] Creating a custom categorized corpus in NLTK and Python

查看:28
本文介绍了在 NLTK 和 Python 中创建自定义分类语料库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了一个与正则表达式和 Python 中的 CategorizedPlaintextCorpusReader 相关的问题.

I'm experiencing a bit of a problem which has to do with regular expressions and CategorizedPlaintextCorpusReader in Python.

我想创建一个自定义分类语料库并在其上训练一个朴素贝叶斯分类器.我的问题如下:我想要两个类别,pos"和neg".正例文件都在一个目录中,main_dir/pos/*.txt,负例文件在一个单独的目录中,main_dir/neg/*.txt.

I want to create a custom categorized corpus and train a Naive-Bayes classifier on it. My issue is the following: I want to have two categories, "pos" and "neg". The positive files are all in one directory, main_dir/pos/*.txt, and the negative ones are in a separate directory, main_dir/neg/*.txt.

如何使用CategorizedPlaintextCorpusReader 加载并标记pos 目录中的所有正面文件,并对负面文件执行相同操作?

How can I use the CategorizedPlaintextCorpusReader to load and label all the positive files in the pos directory, and do the same for the negative ones?

注意:设置与 Movie_reviews 语料库 (~nltk_datacorporamovie_reviews) 完全相同.

NB: The setup is absolutely the same as the Movie_reviews corpus (~nltk_datacorporamovie_reviews).

推荐答案

这是我问题的答案.由于我正在考虑使用两个案例,我认为最好同时涵盖这两个案例,以防将来有人需要答案.如果您具有与 movie_review 语料库相同的设置 - 以您希望调用标签的方式相同的方式标记多个文件夹并包含训练数据,您可以使用它.

Here is the answer to my question. Since I was thinking about using two cases I think it's good to cover both in case someone needs the answer in the future. If you have the same setup as the movie_review corpus - several folders labeled in the same way you would like your labels to be called and containing the training data you can use this.

reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*.txt', cat_pattern=r'(w+)/*')

我正在考虑的另一种方法是将所有内容放在一个文件夹中,并将文件命名为 0_neg.txt、0_pos.txt、1_neg.txt 等.读者的代码应如下所示:

The other approach that I was considering is putting everything in a single folder and naming the files 0_neg.txt, 0_pos.txt, 1_neg.txt etc. The code for your reader should look something like:

reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*.txt', cat_pattern=r'd+_(w+).txt')

我希望这会在未来对某人有所帮助.

I hope that this would help someone in the future.

这篇关于在 NLTK 和 Python 中创建自定义分类语料库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆