(GATE)如何让Minipar与Ö,Ü,Ä等特殊字符一起玩? [英] (GATE) How to let Minipar play with special characters like Ö, Ü, Ä?

查看:115
本文介绍了(GATE)如何让Minipar与Ö,Ü,Ä等特殊字符一起玩?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在学习Gate时,遇到了以下问题:

while learning Gate, I encountered the following problem:

Minipar看到不受欢迎的字符(例如Ö,Ü,Ä)时会引发异常.

Minipar throws exception when it sees uncommen characters like Ö, Ü, Ä.

例如,句子""中的"Batten病(也称为Spielmeyer-Vogt-Sjögren-Batten病)是一种罕见的致命常染色体隐性神经退行性疾病,始于儿童期." (来自Wiki文章) Minipar停止工作之前得到的注释是蝙蝠病(也称为Spielmeyer-Vogt-Sj"),恰好在字符ö之前,因此我猜测这是一个值得关注的案例,使用Gate.因为同一条管道像微风一样处理了其他几篇文章.

For example in the sentence "Batten disease (also known as Spielmeyer-Vogt-Sjögren-Batten disease ) is a rare, fatal autosomal recessive neurodegenerative disorder that begins in childhood." (from a wiki article) The annotation Minipar got before it stopped working is "Batten disease (also known as Spielmeyer-Vogt-Sj" which is exactly before the character ö, so this makes me guessing that this is a case worth attention while using Gate. Because the same pipeline processed several other articles like a breeze.

在消息"选项卡中,它提示:

In Messages Tab, it reprots:


gate.util.InvalidOffsetException
    at gate.annotation.AnnotationSetImpl.getNodes(AnnotationSetImpl.java:773)
    at gate.annotation.AnnotationSetImpl.add(AnnotationSetImpl.java:802)
    at minipar.Minipar.runMinipar(Minipar.java:419)
    at minipar.Minipar.execute(Minipar.java:527)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
    at gate.creole.SerialController.executeImpl(SerialController.java:153)
    at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
    at gate.creole.AbstractController.execute(AbstractController.java:75)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
    at java.lang.Thread.run(Unknown Source)
gate.creole.ExecutionException: gate.util.InvalidOffsetException
    at minipar.Minipar.runMinipar(Minipar.java:491)
    at minipar.Minipar.execute(Minipar.java:527)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
    at gate.creole.SerialController.executeImpl(SerialController.java:153)
    at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
    at gate.creole.AbstractController.execute(AbstractController.java:75)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
    at java.lang.Thread.run(Unknown Source)
Caused by: gate.util.InvalidOffsetException
    at gate.annotation.AnnotationSetImpl.getNodes(AnnotationSetImpl.java:773)
    at gate.annotation.AnnotationSetImpl.add(AnnotationSetImpl.java:802)
    at minipar.Minipar.runMinipar(Minipar.java:419)
    ... 9 more
gate.creole.ExecutionException: Document doesn't have sentence annotations. please run tokenizer, sentence splitter and then Minipar
    at minipar.Minipar.saveGateSentences(Minipar.java:194)
    at minipar.Minipar.execute(Minipar.java:525)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.creole.ConditionalSerialController.runComponent(ConditionalSerialController.java:154)
    at gate.creole.SerialController.executeImpl(SerialController.java:153)
    at gate.creole.ConditionalSerialAnalyserController.executeImpl(ConditionalSerialAnalyserController.java:129)
    at gate.creole.AbstractController.execute(AbstractController.java:75)
    at gate.util.Benchmark.executeWithBenchmarking(Benchmark.java:291)
    at gate.gui.SerialControllerEditor$RunAction$1.run(SerialControllerEditor.java:1619)
    at java.lang.Thread.run(Unknown Source)

我要再次感谢Ian的热烈支持.

I'd to thank Ian for his warm support once again.

马特

推荐答案

这似乎是某种与编码相关的问题,但是不幸的是,由于minipar解析器二进制文件似乎不再是我自己,我无法进行任何调试可从常规下载页面获得-我得到了一个很小的文件(小于2kB)灰度JPEG图像,而不是多MB的.tgz.

This appears to be an encoding-related issue of some sort, but unfortunately I can't do any debugging myself as the minipar parser binary no longer appears to be available from the usual download page - I get a small (less than 2kB) greyscale JPEG image instead of a multi-MB .tgz.

您可以尝试一些方法. GATE Minipar包装器将使用解析器的输入文件,并使用运行系统上的默认编码读取解析器的输出.我的猜测是解析器以不同的编码(可能与原始训练数据的编码有关)来产生其输出.

There's a few things you could try off the top of my head. The GATE Minipar wrapper writes the input file for the parser and reads the parser's output using whatever is the default encoding on the system where you're running. My speculation is that the parser is producing its output in a different encoding (possibly related to the encoding of the original training data?).

GATE包装器将其输入写入一个临时文件,只要您使GATE Developer在后台运行(您可以在退出Developer时删除临时文件),就应该可以在临时目录中找到该文件.我会尝试从命令行在该文件上运行minipar-windows.exe并查看输出是什么样的

The GATE wrapper writes its input to a temporary file which you should be able to find in your temporary directory as long as you leave GATE Developer running in the background (the temp files are deleted when Developer exits). I would try running minipar-windows.exe on that file from the command line and seeing what the output looks like

C:\path\to\minipar-windows.exe -p C:\path\to\minipar\data -file GATESentencesNNNNNN.txt

输出可能会给您提供有关失败原因的线索.如果看起来正确,并且可以确定尝试使用的编码,则可以将GATE Developer设置为将其用作默认编码(如果使用gate.exe来启动它,则可以通过添加行-Dfile.encoding=ISO-8859-1来完成此操作或gate.l4j.ini中的任何内容),看看是否有帮助.如果是这样,我们可以考虑向PR添加一个参数,以指定与解析器可执行文件交换数据时要使用的编码.

The output may give you a clue as to what's failing. If it looks right and you can determine the encoding it's trying to use you could set your GATE Developer to use that as its default encoding (if you're using gate.exe to start it then you do this by adding a line -Dfile.encoding=ISO-8859-1 or whatever to gate.l4j.ini) and see if that helps. If so we can consider adding a parameter to the PR to specify the encoding to use when exchanging data with the parser executable.

这篇关于(GATE)如何让Minipar与Ö,Ü,Ä等特殊字符一起玩?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆