在NLTK和Python中创建自定义分类的语料库 [英] Creating a custom categorized corpus in NLTK and Python

查看：200 发布时间：2020/5/18 0:29:17 python regex nlp nltk

本文介绍了在NLTK和Python中创建自定义分类的语料库的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我遇到了一些问题，这与正则表达式和Python中的CategorizedPlaintextCorpusReader有关.

I'm experiencing a bit of a problem which has to do with regular expressions and CategorizedPlaintextCorpusReader in Python.

我想创建一个自定义分类的语料库，并在其上训练一个Naive-Bayes分类器.我的问题如下:我想有两个类别，"pos"和"neg".正文件全部位于一个目录main_dir/pos/*.txt中，负文件位于单独的目录main_dir/neg/*.txt中.

I want to create a custom categorized corpus and train a Naive-Bayes classifier on it. My issue is the following: I want to have two categories, "pos" and "neg". The positive files are all in one directory, main_dir/pos/*.txt, and the negative ones are in a separate directory, main_dir/neg/*.txt.

如何使用CategorizedPlaintextCorpusReader加载并标记pos目录中的所有肯定文件，并对否定文件执行相同的操作?

How can I use the CategorizedPlaintextCorpusReader to load and label all the positive files in the pos directory, and do the same for the negative ones?

注意:设置与Movie_reviews语料库(~nltk_data\corpora\movie_reviews)完全相同.

NB: The setup is absolutely the same as the Movie_reviews corpus (~nltk_data\corpora\movie_reviews).

推荐答案

以下是我的问题的答案. 由于我一直在考虑使用两种情况，所以我认为最好覆盖两种情况，以防将来有人需要答案. 如果您具有与movie_review语料库相同的设置-几个以相同的方式标记了标签的文件夹，则您希望调用标签并包含训练数据，可以使用此文件夹.

Here is the answer to my question. Since I was thinking about using two cases I think it's good to cover both in case someone needs the answer in the future. If you have the same setup as the movie_review corpus - several folders labeled in the same way you would like your labels to be called and containing the training data you can use this.

reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'(\w+)/*')

我正在考虑的另一种方法是将所有内容放在一个文件夹中，并命名文件0_neg.txt，0_pos.txt，1_neg.txt等.您的阅读器代码应类似于:

The other approach that I was considering is putting everything in a single folder and naming the files 0_neg.txt, 0_pos.txt, 1_neg.txt etc. The code for your reader should look something like:

reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'\d+_(\w+)\.txt')

我希望这会在将来对某人有所帮助.

I hope that this would help someone in the future.

这篇关于在NLTK和Python中创建自定义分类的语料库的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在NLTK和Python中创建自定义分类的语料库 [英] Creating a custom categorized corpus in NLTK and Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在NLTK和Python中创建自定义分类的语料库 [英] Creating a custom categorized corpus in NLTK and Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭