如何下载NLTK数据? [英] How do I download NLTK data?

查看:396
本文介绍了如何下载NLTK数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新后的答案:NLTK适用于2.7.我有3.2.我卸载了3.2,然后安装了2.7.现在可以了!!

Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!!

我已经安装了NLTK并尝试下载NLTK数据.我所做的就是遵循此站点上的说明: http://www.nltk.org/data.html

I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution on this site: http://www.nltk.org/data.html

我下载了NLTK,安装了它,然后尝试运行以下代码:

I downloaded NLTK, installed it, and then tried to run the following code:

>>> import nltk
>>> nltk.download()

它给了我以下错误消息:

It gave me the error message like below:

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    nltk.download()
AttributeError: 'module' object has no attribute 'download'
 Directory of C:\Python32\Lib\site-packages

尝试了nltk.download()nltk.downloader(),都给了我错误消息.

Tried both nltk.download() and nltk.downloader(), both gave me error messages.

然后我用help(nltk)取出包装,它显示以下信息:

Then I used help(nltk) to pull out the package, it shows the following info:

NAME
    nltk

PACKAGE CONTENTS
    align
    app (package)
    book
    ccg (package)
    chat (package)
    chunk (package)
    classify (package)
    cluster (package)
    collocations
    corpus (package)
    data
    decorators
    downloader
    draw (package)
    examples (package)
    featstruct
    grammar
    help
    inference (package)
    internals
    lazyimport
    metrics (package)
    misc (package)
    model (package)
    parse (package)
    probability
    sem (package)
    sourcedstring
    stem (package)
    tag (package)
    test (package)
    text
    tokenize (package)
    toolbox
    tree
    treetransforms
    util
    yamltags

FILE
    c:\python32\lib\site-packages\nltk

我确实在那儿看到了Downloader,不确定为什么它不起作用. Python 3.2.2,系统Windows Vista.

I do see Downloader there, not sure why it does not work. Python 3.2.2, system Windows vista.

推荐答案

TL; DR

要下载特定的数据集/模型,请使用nltk.download()函数,例如如果您要下载punkt句子标记器,请使用:

TL;DR

To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use:

$ python3
>>> import nltk
>>> nltk.download('punkt')

如果不确定所需的数据/模型,可以使用以下数据和模型的基本列表开始:

If you're unsure of which data/model you need, you can start out with the basic list of data + models with:

>>> import nltk
>>> nltk.download('popular')

它将下载受欢迎"资源的列表,其中包括:

It will download a list of "popular" resources, these includes:

<collection id="popular" name="Popular packages">
      <item ref="cmudict" />
      <item ref="gazetteers" />
      <item ref="genesis" />
      <item ref="gutenberg" />
      <item ref="inaugural" />
      <item ref="movie_reviews" />
      <item ref="names" />
      <item ref="shakespeare" />
      <item ref="stopwords" />
      <item ref="treebank" />
      <item ref="twitter_samples" />
      <item ref="omw" />
      <item ref="wordnet" />
      <item ref="wordnet_ic" />
      <item ref="words" />
      <item ref="maxent_ne_chunker" />
      <item ref="punkt" />
      <item ref="snowball_data" />
      <item ref="averaged_perceptron_tagger" />
    </collection>


已编辑

万一有人从 https://stackoverflow.com/a/38135306从nltk下载较大的数据集的情况下避免出现错误, /610569


EDITED

In case anyone is avoiding errors from downloading larger datasets from nltk, from https://stackoverflow.com/a/38135306/610569

$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip
$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite
$ python

>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('popular')

已更新

从v3.2.5开始,当nltk_data找不到资源,例如:

Updated

From v3.2.5, NLTK has a more informative error message when nltk_data resource is not found, e.g.:

>>> from nltk import word_tokenize
>>> word_tokenize('x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load
    opened_resource = _open(resource_url)
  File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open
    return find(path_, path + ['']).open()
  File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource punkt not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('punkt')

  Searched in:
    - '/Users/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************

相关

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆