如何用NLTK建立带有POS标签的语料库? [英] How to build POS-tagged corpus with NLTK?

查看：243 发布时间：2020/5/18 1:02:22 python nlp nltk pos-tagger tagged-corpus

本文介绍了如何用NLTK建立带有POS标签的语料库?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试从外部.txt文件构建带有POS标签的语料库，以进行分块以及实体和关系提取.到目前为止，我已经找到了一个繁琐的多步骤解决方案:

I try to build a POS-tagged corpus from external .txt files for chunking and entity and relation extraction. So far I have found a cumbersome multistep solution:

将文件读入纯文本语料库:

Read files with into a plain text corpus:

from nltk.corpus.reader import PlaintextCorpusReader
my_corp = PlaintextCorpusReader(".", r".*\.txt")

带有内置Penn POS-tagger的标记语料库:

Tag corpus with built-in Penn POS-tagger:

my_tagged_corp= nltk.batch_pos_tag(my_corp.sents())

(顺便说一句，在此点，Python抛出了错误:NameError: name 'batch' is not defined)

(By the way, at this pont Python threw an error: NameError: name 'batch' is not defined)

将标记的句子写到文件中

Write out tagged sentences into file:

taggedfile = open("output.txt" , "w")
for sent in dd_tagged:
    line = " ".join( w+"/"+t for (w, t) in sent )
taggedfile.write(line + "\n")
taggedfile.close ()

最后，再次以标记的语料读取此输出:

And finally, read this output again as tagged corpus:

from nltk.corpus.reader import TaggedCorpusReader
my_corpus2 = TaggedCorpusReader(".",r"output.txt")

对于一个非常普通的任务(分块总是请求带标签的语料库)来说，这都是非常不便的.我的问题是:是否有更紧凑，更优雅的方法来实现这一目标?例如，可以同时获取原始输入文件和标记器的语料库阅读器?

That is all very inconvenient for a quite common task (chunking always requests tagged corpus). My question is: is there a more compact and elegant way to implement this? A corpus reader that gets raw input files and a tagger at the same time for instance?

推荐答案

我得到了可行的解决方案: 请按链接进行逐步操作.

I got the working solution for this: Kindly refer to link for step by step procedure.

从此处下载必要的文件.

一旦您遵循 1 中的命令，就会生成泡菜文件这是您标记的语料库.

Once you follow commands from 1 pickle file will be generated and this is your tagged corpus.

一旦生成了泡菜文件，您可以通过运行以下代码来检查标记器是否工作正常:

Once pickle file is generated you can check whether your tagger is working fine by running following piece of code:

import nltk.data
tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
tagger.tag(['some', 'words', 'in', 'a', 'sentence'])

这篇关于如何用NLTK建立带有POS标签的语料库?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何用NLTK建立带有POS标签的语料库? [英] How to build POS-tagged corpus with NLTK?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何用NLTK建立带有POS标签的语料库? [英] How to build POS-tagged corpus with NLTK?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭