nltk“未知网址"错误 [英] nltk 'unknown url' error

查看:154
本文介绍了nltk“未知网址"错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行内部使用NLTK标记化的python脚本.这是脚本中初始化NLTK的代码部分

I am trying to run a python script which uses NLTK tokenizing internally. Here is the part of code from the script which initializes NLTK

class NLTKTagger:
'''
class that supplies part of speech tags using NLTK
note: avoids the NLTK downloader (see __init__ method)
'''
def __init__(self):
    import nltk
    from nltk.tag import PerceptronTagger
    from nltk.tokenize import TreebankWordTokenizer
    tokenizer_fn = os.path.abspath(resource_filename('phrasemachine.data', 'punkt.english.pickle'))
    tagger_fn = os.path.abspath(resource_filename('phrasemachine.data', 'averaged_perceptron_tagger.pickle'))
    # Load the tagger
    self.tagger = PerceptronTagger(load=False)
    self.tagger.load(tagger_fn)

    # note: nltk.word_tokenize calls the TreebankWordTokenizer, but uses the downloader.
    #       Calling the TreebankWordTokenizer like this allows skipping the downloader.
    #       It seems the TreebankWordTokenizer uses PTB tokenization = regexes. i.e. no downloads
    #       https://github.com/nltk/nltk/blob/develop/nltk/tokenize/treebank.py#L25
    self.tokenize = TreebankWordTokenizer().tokenize
    self.sent_detector = nltk.data.load(tokenizer_fn)

我遇到以下错误

    Traceback (most recent call last):
  File "C:\Users\Uzair\Desktop\phrasemachine_test.py", line 3, in <module>
    phrasemachine.get_phrases(text)
  File "C:\Program Files\Python36-32\lib\site-packages\phrasemachine\phrasemachine.py", line 260, in get_phrases
    tagger = TAGGER_NAMES[tagger]()
  File "C:\Program Files\Python36-32\lib\site-packages\phrasemachine\phrasemachine.py", line 173, in get_stdeng_nltk_tagger
    tagger = NLTKTagger()
  File "C:\Program Files\Python36-32\lib\site-packages\phrasemachine\phrasemachine.py", line 140, in __init__
    self.tagger.load(tagger_fn)
  File "C:\Program Files\Python36-32\lib\site-packages\nltk\tag\perceptron.py", line 209, in load
    self.model.weights, self.tagdict, self.classes = load(loc)
  File "C:\Program Files\Python36-32\lib\site-packages\nltk\data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "C:\Program Files\Python36-32\lib\site-packages\nltk\data.py", line 924, in _open
    return urlopen(resource_url)
  File "C:\Program Files\Python36-32\lib\urllib\request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Program Files\Python36-32\lib\urllib\request.py", line 526, in open
    response = self._open(req, data)
  File "C:\Program Files\Python36-32\lib\urllib\request.py", line 549, in _open
    'unknown_open', req)
  File "C:\Program Files\Python36-32\lib\urllib\request.py", line 504, in _call_chain
    result = func(*args)
  File "C:\Program Files\Python36-32\lib\urllib\request.py", line 1388, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: c>

我在Windows 7和NLTK 3.2.1上使用Python 3.6. 我尝试了在提到的解决方案 此处

I am using Python 3.6 on Windows 7 and NLTK 3.2.1. I tried solutions mentioned at here and here But none worked. Any other solution?

推荐答案

数据加载程序将路径名中的C:前缀误认为是协议名称,例如http:.我认为此问题已得到解决.为避免此问题,请在路径的开头添加file:"协议.例如,

The data loader is mistaking the C: prefix in your path for a protocol name like http:. I thought this had been fixed already... To avoid the problem, add the file:" protocol at the start of your path. E.g.,

self.tagger.load("file://"+tagger_fn)

(有更好的方法来组织代码,但这取决于您.)

(There are better ways to structure your code, but that's up to you.)

从技术上讲,这不是错误,因为nltk.data.load()需要URL,而不是文件系统路径.但是实际上应该将其修复,处理Windows路径并不难...

Technically this is not a bug since nltk.data.load() expects a URL, not a filesystem path. But really it ought to be fixed, it's not that hard to handle Windows paths...

这篇关于nltk“未知网址"错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆