nltk不会添加$ NLTK_DATA到搜索路径? [英] nltk doesn't add $NLTK_DATA to search path?
问题描述
在linux下,我设置了env var $ NLTK_DATA('/ home / user / data / nltk'),并按预期的方式进行测试工作。
>>>来自nltk.corpus import brown
/ pre>
>>>> brown.words()
['The','Fulton','County','Grand','Jury','said',...]
但运行另一个python脚本时,我得到:
LookupError:
******************************************* ***************************
资源u'tokenizers / punkt / english.pickle'未找到。请
使用NLTK下载器获取资源:>>>
nltk.download()
搜索:
- '/ home / user / nltk_data'
- '/ usr / share / nltk_data'
- '/ usr / local / share / nltk_data'
- '/ usr / lib / nltk_data'
- '/ usr / local / lib / nltk_data'
- u''
我们可以看到,nltk不会添加$ NLTK_DATA到搜索路径,之后手动添加NLTK_DATA目录:
nltk.data.path.append(/ NLTK_DATA_DIR);
脚本按预期运行,问题是:
如何使nltk自动添加$ NLTK_DATA到它的搜索路径?
解决方案如果你不想设置$ NLTK_DATA在运行脚本之前,可以在python脚本中执行以下操作:
import nltk
nltk.path .append('/ home / alvas / some_path / nltk_data /')
例如让我们将
nltk_data
移动到NLTK不会自动找到的非标准路径:alvas @ ubi:〜$ ls nltk_data /
chunkers语法语法帮助其他模型干扰标记符tokenizers
alvas @ ubi:〜$ mkdir some_path
alvas @ ubi:〜 $ mv nltk_data / some_path /
alvas @ ubi:〜$ ls nltk_data /
ls:无法访问nltk_data /:没有这样的文件或目录
alvas @ ubi:〜$ ls some_path / nltk_data /
chunkers语料库语法帮助其他模型干扰标签器tokenizers
现在,我们使用
nltk.path.append()
hack:alvas @ ubi:〜$ python
>>> import os
>>> import nltk
>>>> nltk.path.append('/ home / alvas / some_path / nltk_data /')
>>> nltk.pos_tag('这是一个foo bar'.split())
[('this','DT'),('is','VBZ'),('a','DT') ,('foo','JJ'),('bar','NN')]
>>>来自'/usr/local/lib/python2.7/dist-packages/nltk/data.pyc'>的nltk.data
< module'nltk.data'
>>> nltk.data.path
['/ home / alvas / some_path / nltk_data /','/ home / alvas / nltk_data','/ usr / share / nltk_data','/ usr / local / share / nltk_data' ,'/ usr / lib / nltk_data','/ usr / local / lib / nltk_data']
>>> exit()
让我们回来看看它是否有效:
alvas @ ubi:〜$ ls nltk_data
ls:无法访问nltk_data:没有这样的文件或目录
alvas @ ubi:〜$ mv some_path / nltk_data /。
alvas @ ubi:〜$ python
>>> import nltk
>>>> nltk.data.path
['/ home / alvas / nltk_data','/ usr / share / nltk_data','/ usr / local / share / nltk_data','/ usr / lib / nltk_data' usr / local / lib / nltk_data']
>>> nltk.pos_tag('这是一个foo bar'.split())
[('this','DT'),('is','VBZ'),('a','DT') ,('foo','JJ'),('bar','NN')]
如果您真的想要自动找到nltk_data,请使用以下内容:
import scandir
import os,sys
import time
import nltk
def find(name,path):
for root,dirs,文件在scandir.walk(路径):
如果root.endswith(名称):
返回根
def find_nltk_data():
start = time.time )
path_to_nltk_data = find('nltk_data','/')
print>> sys.stderr,'查找nltk_data',time.time() - start
print>> sys.stderr,'nltk_data at',path_to_nltk_data
with open('where_is_nltk_data.txt','w')as fout:
fout.write(path_to_nltk_data)
return path_to_nltk_data
def magically_find_nltk_data():
如果os.path.exists('where_is_nltk_data.txt'):
with open('where_is_nltk_data.txt')as fin:
path_to_nltk_data = fin .read()。strip()
如果os.path.exists(path_to_nltk_data):
nltk.data.path.append(path_to_nltk_data)
else:
nltk.data。 path.append(find_nltk_data())
else:
path_to_nltk_data = find_nltk_data()
nltk.data.path.append(path_to_nltk_data)
magically_find_nltk_data ()
打印nltk.pos_tag('这是一个foo bar'.split())
我们来调用这个python脚本,
test.py
:@ubi:〜$ ls nltk_data /
chunkers语法语法帮助misc模型阻塞者标记tokenizers
alvas @ ubi:〜$ python test.py
查找nltk_data取得4.27330780029
nltk_data at / home / alvas / nltk_data
[('this',' DT'),('is','VBZ'),('a','DT'),('foo','JJ'),('bar','NN')]
alvas @ ubi:〜$ mv nltk_data / some_path /
alvas @ ubi:〜$ python test.py
在/ home / alvas / some_path / nltk_data
中查找nltk_data花费4.75850391388
nltk_data [ ('this','DT'),('is','VBZ'),('a','DT'),('foo','JJ'),('bar','NN')]
under linux,I have set env var $NLTK_DATA('/home/user/data/nltk'),and blew test works as expected
>>> from nltk.corpus import brown >>> brown.words() ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
but when running another python script,I got:
LookupError: ********************************************************************** Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/home/user/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - u''
As we can see,nltk doesn't add $NLTK_DATA to search path,after appending NLTK_DATA dir manually:
nltk.data.path.append("/NLTK_DATA_DIR");
script runs as expected,question is:
How to make nltk to add $NLTK_DATA to it's search path automatically?
解决方案If you don't want to set the $NLTK_DATA before running your scripts, you can do it within the python scripts with:
import nltk nltk.path.append('/home/alvas/some_path/nltk_data/')
E.g. let's move the the
nltk_data
to a non-standard path that NLTK won't find it automatically:alvas@ubi:~$ ls nltk_data/ chunkers corpora grammars help misc models stemmers taggers tokenizers alvas@ubi:~$ mkdir some_path alvas@ubi:~$ mv nltk_data/ some_path/ alvas@ubi:~$ ls nltk_data/ ls: cannot access nltk_data/: No such file or directory alvas@ubi:~$ ls some_path/nltk_data/ chunkers corpora grammars help misc models stemmers taggers tokenizers
Now, we use the
nltk.path.append()
hack:alvas@ubi:~$ python >>> import os >>> import nltk >>> nltk.path.append('/home/alvas/some_path/nltk_data/') >>> nltk.pos_tag('this is a foo bar'.split()) [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')] >>> nltk.data <module 'nltk.data' from '/usr/local/lib/python2.7/dist-packages/nltk/data.pyc'> >>> nltk.data.path ['/home/alvas/some_path/nltk_data/', '/home/alvas/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data'] >>> exit()
Let's move it back and see whether it works:
alvas@ubi:~$ ls nltk_data ls: cannot access nltk_data: No such file or directory alvas@ubi:~$ mv some_path/nltk_data/ . alvas@ubi:~$ python >>> import nltk >>> nltk.data.path ['/home/alvas/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data'] >>> nltk.pos_tag('this is a foo bar'.split()) [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
If you really really want to find nltk_data automagically, use something like:
import scandir import os, sys import time import nltk def find(name, path): for root, dirs, files in scandir.walk(path): if root.endswith(name): return root def find_nltk_data(): start = time.time() path_to_nltk_data = find('nltk_data', '/') print >> sys.stderr, 'Finding nltk_data took', time.time() - start print >> sys.stderr, 'nltk_data at', path_to_nltk_data with open('where_is_nltk_data.txt', 'w') as fout: fout.write(path_to_nltk_data) return path_to_nltk_data def magically_find_nltk_data(): if os.path.exists('where_is_nltk_data.txt'): with open('where_is_nltk_data.txt') as fin: path_to_nltk_data = fin.read().strip() if os.path.exists(path_to_nltk_data): nltk.data.path.append(path_to_nltk_data) else: nltk.data.path.append(find_nltk_data()) else: path_to_nltk_data = find_nltk_data() nltk.data.path.append(path_to_nltk_data) magically_find_nltk_data() print nltk.pos_tag('this is a foo bar'.split())
Let's call that python script,
test.py
:alvas@ubi:~$ ls nltk_data/ chunkers corpora grammars help misc models stemmers taggers tokenizers alvas@ubi:~$ python test.py Finding nltk_data took 4.27330780029 nltk_data at /home/alvas/nltk_data [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')] alvas@ubi:~$ mv nltk_data/ some_path/ alvas@ubi:~$ python test.py Finding nltk_data took 4.75850391388 nltk_data at /home/alvas/some_path/nltk_data [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
这篇关于nltk不会添加$ NLTK_DATA到搜索路径?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!