如何下载NLTK数据? [英] How do I download NLTK data?
问题描述
更新后的答案:NLTK适用于2.7.我有3.2.我卸载了3.2,然后安装了2.7.现在可以了!!
Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!!
我已经安装了NLTK并尝试下载NLTK数据.我所做的就是遵循此站点上的说明: http://www.nltk.org/data.html
I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution on this site: http://www.nltk.org/data.html
我下载了NLTK,安装了它,然后尝试运行以下代码:
I downloaded NLTK, installed it, and then tried to run the following code:
>>> import nltk
>>> nltk.download()
它给了我以下错误消息:
It gave me the error message like below:
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
nltk.download()
AttributeError: 'module' object has no attribute 'download'
Directory of C:\Python32\Lib\site-packages
尝试了nltk.download()
和nltk.downloader()
,都给了我错误消息.
Tried both nltk.download()
and nltk.downloader()
, both gave me error messages.
然后我用help(nltk)
取出包装,它显示以下信息:
Then I used help(nltk)
to pull out the package, it shows the following info:
NAME
nltk
PACKAGE CONTENTS
align
app (package)
book
ccg (package)
chat (package)
chunk (package)
classify (package)
cluster (package)
collocations
corpus (package)
data
decorators
downloader
draw (package)
examples (package)
featstruct
grammar
help
inference (package)
internals
lazyimport
metrics (package)
misc (package)
model (package)
parse (package)
probability
sem (package)
sourcedstring
stem (package)
tag (package)
test (package)
text
tokenize (package)
toolbox
tree
treetransforms
util
yamltags
FILE
c:\python32\lib\site-packages\nltk
我确实在那儿看到了Downloader,不确定为什么它不起作用. Python 3.2.2,系统Windows Vista.
I do see Downloader there, not sure why it does not work. Python 3.2.2, system Windows vista.
推荐答案
TL; DR
要下载特定的数据集/模型,请使用nltk.download()
函数,例如如果您要下载punkt
句子标记器,请使用:
TL;DR
To download a particular dataset/models, use the nltk.download()
function, e.g. if you are looking to download the punkt
sentence tokenizer, use:
$ python3
>>> import nltk
>>> nltk.download('punkt')
如果不确定所需的数据/模型,可以使用以下数据和模型的基本列表开始:
If you're unsure of which data/model you need, you can start out with the basic list of data + models with:
>>> import nltk
>>> nltk.download('popular')
它将下载受欢迎"资源的列表,其中包括:
It will download a list of "popular" resources, these includes:
<collection id="popular" name="Popular packages">
<item ref="cmudict" />
<item ref="gazetteers" />
<item ref="genesis" />
<item ref="gutenberg" />
<item ref="inaugural" />
<item ref="movie_reviews" />
<item ref="names" />
<item ref="shakespeare" />
<item ref="stopwords" />
<item ref="treebank" />
<item ref="twitter_samples" />
<item ref="omw" />
<item ref="wordnet" />
<item ref="wordnet_ic" />
<item ref="words" />
<item ref="maxent_ne_chunker" />
<item ref="punkt" />
<item ref="snowball_data" />
<item ref="averaged_perceptron_tagger" />
</collection>
已编辑
万一有人从 https://stackoverflow.com/a/38135306从nltk
下载较大的数据集的情况下避免出现错误, /610569
EDITED
In case anyone is avoiding errors from downloading larger datasets from nltk
, from https://stackoverflow.com/a/38135306/610569
$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip
$ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite
$ python
>>> import nltk
>>> dler = nltk.downloader.Downloader()
>>> dler._update_index()
>>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed.
>>> dler.download('popular')
已更新
Updated
From v3.2.5, NLTK has a more informative error message when nltk_data
resource is not found, e.g.:
>>> from nltk import word_tokenize
>>> word_tokenize('x')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load
opened_resource = _open(resource_url)
File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open
return find(path_, path + ['']).open()
File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
Searched in:
- '/Users/alvas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
相关
-
要自动找到
nltk_data
目录,请参见 https://stackoverflow.com/a/36383314/610569Related
To find
nltk_data
directory (auto-magically), see https://stackoverflow.com/a/36383314/610569要将
nltk_data
下载到其他路径,请参见 https://stackoverflow. com/a/48634212/610569要配置
nltk_data
路径(即为NLTK查找nltk_data
设置不同的路径),请参见To config
nltk_data
path (i.e. set a different path for NLTK to findnltk_data
), see https://stackoverflow.com/a/22987374/610569这篇关于如何下载NLTK数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!