nltk PlaintextCorpusReader发送和paras函数不起作用 [英] nltk PlaintextCorpusReader sents and paras functions not working

查看:308
本文介绍了nltk PlaintextCorpusReader发送和paras函数不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法在 PlaintextCorpusReader 中使用paras和sends函数.这是我的代码:

I cannot get the paras and sents function in the PlaintextCorpusReader to work. Here is the code I have:

import nltk
from nltk.corpus import PlaintextCorpusReader

corpus_root = './dir_root'
newcorpus = PlaintextCorpusReader(corpus_root, '.*') # Files you want to add

word_list = newcorpus.words('file1.txt')
sentence_list = newcorpus.sents('file1.txt')
paragraph_list = newcorpus.paras('file1.txt')

print(word_list)
print(sentence_list)
print(paragraph_list)

word_list很好.

word_list comes out fine.

['__________________________________________________________________', 'Title', ...]

但是,段落列表和句子列表都给出此错误:

But, paragraph_list and sentence_list both give this error:

    Traceback (most recent call last):
  File "corpus.py", line 13, in <module>
    print(sentence_list)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/collections.py", line 225, in __repr__
    for elt in self:
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
    tokens = self.read_block(self._stream)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/plaintext.py", line 129, in _read_sent_block
    for sent in self._sent_tokenizer.tokenize(para)])
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 956, in __getattr__
    self.__load()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 948, in __load
    resource = load(self._path)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 808, in load
    opened_resource = _open(resource_url)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 926, in _open
    return find(path_, path + ['']).open()
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 648, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource 'tokenizers/punkt/PY3/english.pickle' not found.
  Please use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/Users/username/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

我尝试使用nltk.download()将文件下载到语料库,但这也不起作用.另外,由于PlaintextCorpusReader已经做到了,因此它似乎不应该工作. paras sents 函数是PlaintextCorpusReader的一部分.我需要输入一个特定的fieldid吗?或者,是否需要某种正则表达式参数来查找句子或段落? 文档源代码似乎并没有说需要更多比单词功能更强大.

I tried using the nltk.download() to download the file into the corpus, but that did not work either. Plus it did not seem like the way it should work since the PlaintextCorpusReader does that already. The paras and sents functions are apart of the PlaintextCorpusReader. Is there a particular fieldid I need to enter? Or, is there some sort of regex argument it requires to find the sentences or paragraphs? The documentation and source code does not seem to say it needs anything more than the words function does.

推荐答案

您缺少句子标记程序所需的数据文件(资源").通过在交互式下载器的模型"下下载"punkt"资源来解决此问题,或者通过一次运行以下代码来非交互式地解决该问题:

You're missing a data file ("resource") needed by the sentence tokenizer. Fix the problem by downloading the "punkt" resource under "Models" in the interactive downloader, or non-interactively by running this code once:

nltk.download("punkt")

为避免在探索nltk时反复遇到此类问题,建议立即下载书"捆绑包.它包含您可能需要一段时间的所有内容.

To avoid running into this kind of problem repeatedly as you explore the nltk, I recommend downloading the "book" bundle now. It contains everything you're likely to need for a while.

这篇关于nltk PlaintextCorpusReader发送和paras函数不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆