NLTK STEMMER:字符串索引超出范围 [英] nltk stemmer: string index out of range

查看:177
本文介绍了NLTK STEMMER:字符串索引超出范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组腌制的文本文档,我想使用nltk的PorterStemmer来阻止它们.出于特定于我项目的原因,我想在django应用程序视图中进行词根查找.

I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer. For reasons specific to my project, I would like to do the stemming inside of a django app view.

但是,当在django视图中提取文档时,我从PorterStemmer().stem()收到了字符串'oed'IndexError: string index out of range异常.结果,运行以下命令:

However, when stemming the documents inside the django view, I receive an IndexError: string index out of range exception from PorterStemmer().stem() for the string 'oed'. As a result, running the following:

# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer

def get_results(request):
    s = PorterStemmer()
    s.stem('oed')
    return render(request, 'list.html')

引发上述错误:

Traceback (most recent call last):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
    response = get_response(request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
    s.stem('oed')
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
    stem = self._step1b(stem)
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
    lambda stem: (self._measure(stem) == 1 and
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
    if suffix == '*d' and self._ends_double_consonant(word):
  File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
    word[-1] == word[-2] and
IndexError: string index out of range

现在真正奇怪的是,在django之外的相同字符串上运行相同的词干提取器(无论是单独的python文件还是交互式python控制台)都不会产生错误.换句话说:

Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:

# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')

其次:

python test.py
# successfully prints 'o'

什么原因导致了此问题?

what is causing this issue?

推荐答案

这是NLTK 3.2.2版特有的NLTK错误,对此我应该负责.它由PR https://github.com/nltk/nltk/pull/1261改写了波特词干.

This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.

我写了一个修补程序,该修补程序在NLTK 3.2.3中发布.如果您使用的是3.2.2版,并且需要此修复程序,则只需升级-例如通过运行

I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running

pip install -U nltk

这篇关于NLTK STEMMER:字符串索引超出范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆