在Anaconda/NLTK中找不到Genia Tagger文件错误 [英] Genia Tagger file not found error in Anaconda/NLTK

查看:204
本文介绍了在Anaconda/NLTK中找不到Genia Tagger文件错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要执行文本预处理任务,例如使用NLTK进行句子拆分,标记化和标记.我想使用 GENIA 标记器进行标记.我正在使用Anaconda 3.10版,并通过以下命令安装了geniatagger.

I need to perform text pre-processing tasks such as sentence splitting, tokenization and tagging using NLTK. I want to use GENIA tagger for tagging. I am using Anaconda version 3.10 and installed geniatagger by the following command.

python setup.py install

在IPython控制台中,以下代码输入了以下代码.

In the IPython console, the following I entered the following code.

import geniatagger
tagger =geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
print tagger.parse('Welcome to natural language processing!')

按Enter键时,出现以下错误消息.

The following error message appears when pressed Enter.

---------------------------------------------------------------------------
WindowsError                              Traceback (most recent call last)
<ipython-input-2-52e4d65c2d02> in <module>()
----> 1 tagger = geniatagger.GeniaTagger('C:\Users\dell\Anaconda\geniatagger\geniatagger')
  2 print tagger.parse('Welcome to natural language processing!')
  3 

 C:\Users\dell\Anaconda\lib\site-packages\geniatagger_python-0.1-py2.7.egg\geniatagger.pyc in __init__(self, path_to_tagger)
 19         self._tagger = subprocess.Popen('./'+os.path.basename(path_to_tagger),
 20                                         cwd=self._dir_to_tagger,
 ---> 21                                         stdin=subprocess.PIPE, stdout=subprocess.PIPE)
 22 
 23     def parse(self, text):

 C:\Users\dell\Anaconda\lib\subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
708                                 p2cread, p2cwrite,
709                                 c2pread, c2pwrite,
--> 710                                 errread, errwrite)
711         except Exception:
712             # Preserve original exception in case os.close raises.

C:\Users\dell\Anaconda\lib\subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
956                                          env,
957                                          cwd,
--> 958                                          startupinfo)
959             except pywintypes.error, e:
960                 # Translate pywintypes.error to WindowsError, which is

WindowsError: [Error 2] The system cannot find the file specified

为什么我会收到此错误消息?我怎样才能解决这个问题?

Why do I get this error message? How can I fix this?

如果我立即使用此标记,它还会执行标记化部分吗?

If I use this tagging straight away, will it perform the tokenization part as well?

注意:geniatagger python文件位于'geniatagger'文件夹中.

Note: geniatagger python file is inside the 'geniatagger' folder.

推荐答案

TL; DR :

# Install Genia Tagger (C code).
$ git clone https://github.com/saffsd/geniatagger && cd geniatagger && make && cd ..
# Install Genia Tagger (python wrapper)
$ git clone https://github.com/informationsea/geniatagger-python.git && cd geniatagger-python && sudo python setup.py install && cd ..
$ python
>>> from geniatagger import GeniaTagger
>>> tagger = GeniaTagger('./geniatagger/geniatagger')
>>> loading morphdic...done.
loading pos_models................done.
loading chunk_models....done.
loading named_entity_models..done.

>>> print tagger.parse('This is a pen.')
[('This', 'This', 'DT', 'B-NP', 'O'), ('is', 'be', 'VBZ', 'B-VP', 'O'), ('a', 'a', 'DT', 'B-NP', 'O'), ('pen', 'pen', 'NN', 'I-NP', 'O'), ('.', '.', '.', 'O', 'O')]


我不确定Genia tagger的软件包是否可以从conda开箱即用,因此我认为原生python/pip修复更简单.


I'm not sure whether the packages for Genia tagger works out of the box from conda, so i think a native python/pip fix is simpler.

首先,NLTK中不支持Genia Tagger(至少还没有=)),因此NLTK安装/模块不是问题.

Firstly, there's no support for Genia Tagger in NLTK (At least not yet =) ), so it isn't a problem with the NLTK installation/modules.

问题可能出在原始GeniaTagger C代码使用的某些过时导入中( http://www.nactem.ac.uk/tsujii/GENIA/tagger/).

The problem might lie in some outdated imports that the original GeniaTagger C code uses (http://www.nactem.ac.uk/tsujii/GENIA/tagger/).

因此,要解决该问题,您必须在原始代码中添加#include <cstdlib>,但值得庆幸的是@saffsd已将其添加到他的github存储库中(

So to resolve the problem, you have to add #include <cstdlib> to the original code but thankfully @saffsd has already done so and put it nicely in his github repo (https://github.com/saffsd/geniatagger/blob/master/morph.cpp)

然后安装python包装器,您可以:

Then comes installing the python wrapper, you can either:

  • 从官方pypi安装,并带有:pip install https://pypi.python.org/packages/source/g/geniatagger-python/geniatagger-python-0.1.tar.gz

或使用其他一些github存储库进行安装,例如从Google搜索首先出现的 https://github.com/informationsea/geniatagger-python /p>

or use some other github repo to install, e.g. https://github.com/informationsea/geniatagger-python that appears first from google search

最后,python中的GeniaTagger初始化相当奇怪,因为它并没有真正使用标记器目录的路径,而是标记器本身,并假定模型文件与标记器位于同一目录,请参见 https://github.com/informationsea/geniatagger-python/blob/master/geniatagger.py#L19 .

Lastly, the GeniaTagger initialization in python is rather weird because it doesn't really take the path to the directory of the tagger but the tagger itself and assumes that the model files are in the same directory as the tagger, see https://github.com/informationsea/geniatagger-python/blob/master/geniatagger.py#L19 .

并且可能希望在目录路径的第一级中使用一些'./',因此您必须将标记程序初始化为GeniaTagger('./geniatagger/geniatagger').

And possibly it expects some use of './' in the first level of directory path, so you would have to initialize the tagger as such GeniaTagger('./geniatagger/geniatagger').

除了安装问题.如果您为GeniaTagger使用python包装器,则GeniaTagger对象中只有一个函数,即parse(),当您使用parse()时,它将为每个句子输出一个元组列表,并且输入是一个句子字符串.每个元组中的项目是:

Beyond the installation issues. If you use the python wrapper for the GeniaTagger, there's only one function in the GeniaTagger object, i.e. parse(), when you use parse(), it will output a list of tuples for each sentence and the input is one sentence string. The items in each tuple are:

  • token (surface word)
  • lemma (see Stemmers vs Lemmatizers)
  • POS tag (looks like Penn Treebank tagset, see What are all possible pos tags of NLTK?)
  • Noun chunk (see Output results in conll format (POS-tagging, stanford pos tagger))
  • Named Entity chunk

这篇关于在Anaconda/NLTK中找不到Genia Tagger文件错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆