需要在NLTK和Python中设置分类的语料库阅读器，将语料库文本放在一个文件中，每行一个文本 [英] Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

查看：99 发布时间：2020/5/18 1:20:09 python-2.7 text nltk corpus categorization

本文介绍了需要在NLTK和Python中设置分类的语料库阅读器，将语料库文本放在一个文件中，每行一个文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Jacob Perkins的书使用NLTK 2.0食谱进行Python文本处理"使我对NLTK和文本分类变得很熟悉.

I am getting familiar with NLTK and text categorization by Jacob Perkins's book "Python Text Processing with NLTK 2.0 Cookbook".

我的语料库文档/文本每个都由一段文本组成，因此它们每个都在单独的文件行中，而不是在单独的文件中.这些段落/行的数量约为200万.因此，大约有200万个机器学习实例.

My corpus documents/texts each consists of a paragraph of text, so each of them is in a separate line of file, not in a separate file. The number of such these paragraphs/lines are about 2 millions. Therefore there are about 2 million on machine learning instances.

文件中的每一行(一段文字-域名，描述，关键字的组合)是特征提取的主题:标记化等，使其成为机器学习算法的实例.

Each line in my file (a paragraph of text - a combination of domain title, description, keywords), that is a subject of feature extraction: tokenization, etc. to make it an instance for a machine learning algorithm.

我有两个这样的文件，包含所有正词和负词.

I have two files like that with all the positives and negavives.

如何将其加载到CategorizedCorpusReader?有可能吗?

How can I load it to CategorizedCorpusReader? Is it possible?

之前我尝试过其他解决方案，例如scikit，最后选择了NLTK，希望从结果开始更容易一点.

I tried other solutions before, like scikit, and finally picked NLTK hoping for an easier point to start with a result.

推荐答案

假定您有两个文件:

file_pos.txt，file_neg.txt

file_pos.txt, file_neg.txt

from nltk.corpus.reader import CategorizedCorpusReader
reader = CategorizedCorpusReader('/path/to/corpora/', \
                                 r'file_.*\.txt', \
                                 cat_pattern=r'file_(\w+)\.txt')

在此之后，您可以像下面那样应用常用的语料库功能:

After this, you can apply the usual Corpus functions to it like:

>>> reader.categories()
['neg', 'pos']
>>> reader.fileids(categories=['neg'])
['file_neg.txt']

以及tags_sents，tagd_words等

As well as tagged_sents, tagged_words, etc.

您可能会喜欢有关创建自定义语料库的本教程: https://www.packtpub.com /books/content/python-text-processing-nltk-20-creating-custom-corpora

You might enjoy this tutorial about creating a custom corpora: https://www.packtpub.com/books/content/python-text-processing-nltk-20-creating-custom-corpora

这篇关于需要在NLTK和Python中设置分类的语料库阅读器，将语料库文本放在一个文件中，每行一个文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

需要在NLTK和Python中设置分类的语料库阅读器，将语料库文本放在一个文件中，每行一个文本 [英] Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

需要在NLTK和Python中设置分类的语料库阅读器，将语料库文本放在一个文件中，每行一个文本 [英] Need to set categorized corpus reader in NLTK and Python, corpus texts in one file, one text per line

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭