nltk PlaintextCorpusReader发送和paras函数不起作用 [英] nltk PlaintextCorpusReader sents and paras functions not working
问题描述
我无法在 PlaintextCorpusReader 中使用paras和sends函数.这是我的代码:
I cannot get the paras and sents function in the PlaintextCorpusReader to work. Here is the code I have:
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = './dir_root'
newcorpus = PlaintextCorpusReader(corpus_root, '.*') # Files you want to add
word_list = newcorpus.words('file1.txt')
sentence_list = newcorpus.sents('file1.txt')
paragraph_list = newcorpus.paras('file1.txt')
print(word_list)
print(sentence_list)
print(paragraph_list)
word_list很好.
word_list comes out fine.
['__________________________________________________________________', 'Title', ...]
但是,段落列表和句子列表都给出此错误:
But, paragraph_list and sentence_list both give this error:
Traceback (most recent call last):
File "corpus.py", line 13, in <module>
print(sentence_list)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/collections.py", line 225, in __repr__
for elt in self:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
tokens = self.read_block(self._stream)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/plaintext.py", line 129, in _read_sent_block
for sent in self._sent_tokenizer.tokenize(para)])
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 956, in __getattr__
self.__load()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 948, in __load
resource = load(self._path)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 808, in load
opened_resource = _open(resource_url)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 926, in _open
return find(path_, path + ['']).open()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 648, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource 'tokenizers/punkt/PY3/english.pickle' not found.
Please use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/Users/username/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
我尝试使用nltk.download()将文件下载到语料库,但这也不起作用.另外,由于PlaintextCorpusReader已经做到了,因此它似乎不应该工作. paras 和 sents 函数是PlaintextCorpusReader的一部分.我需要输入一个特定的fieldid吗?或者,是否需要某种正则表达式参数来查找句子或段落? 文档和源代码似乎并没有说需要更多比单词功能更强大.
I tried using the nltk.download() to download the file into the corpus, but that did not work either. Plus it did not seem like the way it should work since the PlaintextCorpusReader does that already. The paras and sents functions are apart of the PlaintextCorpusReader. Is there a particular fieldid I need to enter? Or, is there some sort of regex argument it requires to find the sentences or paragraphs? The documentation and source code does not seem to say it needs anything more than the words function does.
推荐答案
您缺少句子标记程序所需的数据文件(资源").通过在交互式下载器的模型"下下载"punkt"资源来解决此问题,或者通过一次运行以下代码来非交互式地解决该问题:
You're missing a data file ("resource") needed by the sentence tokenizer. Fix the problem by downloading the "punkt" resource under "Models" in the interactive downloader, or non-interactively by running this code once:
nltk.download("punkt")
为避免在探索nltk时反复遇到此类问题,建议立即下载书"捆绑包.它包含您可能需要一段时间的所有内容.
To avoid running into this kind of problem repeatedly as you explore the nltk, I recommend downloading the "book" bundle now. It contains everything you're likely to need for a while.
这篇关于nltk PlaintextCorpusReader发送和paras函数不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!