Java命令在NLTK Stanford POS Tagger中失败 [英] Java Command Fails in NLTK Stanford POS Tagger

查看:162
本文介绍了Java命令在NLTK Stanford POS Tagger中失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在解决"Java命令失败"错误时,我需要您的帮助和协助,每当我尝试标记2兆字节的阿拉伯语语料库时,该错误就会不断抛出.我搜索了Web和斯坦福POS标记器邮件列表.但是,我没有找到解决方案.我阅读了有关类似问题的一些帖子,并建议将内存用完.我不确定.我仍然有19GB的可用内存.我尝试了提供的所有可能的解决方案,但始终显示相同的错误.

I request your kind help and assistance in solving the error of "Java Command Fails" which keeps throwing whenever I try to tag an Arabic corpus with size of 2 megabytes. I have searched the web and stanford POS tagger mailing list. However, I did not find the solution. I read some posts on problems similar to this, and it was suggested that the memory is used out. I am not sure of that. Still I have 19GB free memory. I tried every possible solution offered, but the same error keeps showing.

我在Python上有普通命令,在Linux上有很好的命令.我正在将LinuxMint17 KDE 64位,Python3.4,NLTK alpha和Stanford POS标记器模型用于阿拉伯语.这是我的代码:

I have average command on Python and good command on Linux. I am using LinuxMint17 KDE 64-bit, Python3.4, NLTK alpha and Stanford POS tagger model for Arabic . This is my code:

import nltk
from nltk.tag.stanford import POSTagger
arabic_postagger = POSTagger("/home/mohammed/postagger/models/arabic.tagger", "/home/mohammed/postagger/stanford-postagger.jar", encoding='utf-8')

print("Executing tag_corpus.py...\n")


# Import corpus file
print("Importing data...\n")

file = open("test.txt", 'r', encoding='utf-8').read()
text = file.strip()

print("Tagging the corpus. Please wait...\n")

tagged_corpus = arabic_postagger.tag(nltk.word_tokenize(text))

如果语料库大小小于1MB(= 100,000个单词),则不会有错误.但是,当我尝试标记2MB语料库时,则会显示以下错误消息:

IF THE CORPUS SIZE IS LESS THAN 1MB ( = 100,000 words), THERE WILL BE NO ERROR. BUT WHEN I TRY TO TAG 2MB CORPUS, THEN THE FOLLOWING ERROR MESSAGE IS SHOWN:

Traceback (most recent call last):
File "/home/mohammed/experiments/current/tag_corpus2.py", line 17, in <module>
tagged_lst = arabic_postagger.tag(nltk.word_tokenize(text))
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 59, in tag
return self.batch_tag([tokens])[0]
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 81, in batch_tag
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/internals.py", line 171, in java
raise OSError('Java command failed!')
OSError: Java command failed!

我打算标记3亿个单词以用于博士学位.研究项目.如果我一次不停地标记10万个单词,我将不得不重复执行该任务3000次.它会杀了我!

I intend to tag 300 Million words to be used in my Ph.D. research project. If I keep tagging 100 thousand words at a time, I will have to repeat the task 3000 times. It will kill me!

非常感谢您的帮助.

推荐答案

在导入行之后添加以下行:

After your import lines add this line:

nltk.internals.config_java(options='-xmx2G')

这将增加java允许Stanford POS Tagger使用的最大RAM大小. '-xmx2G'将最大允许RAM更改为2GB,而不是默认的512MB.

This will increase the maximum RAM size that java allows Stanford POS Tagger to use. The '-xmx2G' changes the maximum allowable RAM to 2GB instead of the default 512MB.

请参见什么时候Xms和Xmx参数是什么启动JVM?了解更多信息

如果您对如何调试代码感兴趣,请继续阅读.

If you're interested in how to debug your code, read on.

因此,我们看到该命令在处理大量数据时失败,因此首先要看的是在

So we see that the command fail when handling huge amount of data so the first thing to look at is how the Java is initialized in NLTK before calling the Stanford tagger, from https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L19 :

from nltk.internals import find_file, find_jar, config_java, java, _java_options

我们看到nltk.internals软件包正在处理不同的Java配置和参数.

We see that the nltk.internals package is handling the different Java configurations and parameters.

然后我们看一下 https://github .com/nltk/nltk/blob/develop/nltk/internals.py#L65 ,我们看到没有为Java的内存分配添加任何值.

Then we take a look at https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L65 and we see that the no value is added for the memory allocation for Java.

这篇关于Java命令在NLTK Stanford POS Tagger中失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆