Stanford POS Tagger:如何在输出中保留换行符? [英] Stanford POS Tagger: How to preserve newlines in the output?
问题描述
我的input.txt文件包含以下示例文本:
My input.txt file contains the following sample text:
你必须让我们
过来看看我.
you have to let's
come and see me.
现在,如果我使用默认命令调用Stanford POS标记器:
Now if I invoke the Stanford POS tagger with the default command:
java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt
在output.txt文件中得到以下内容:
I get the following in my output.txt file:
you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.
上述输出的问题是我丢失了输入文件中使用的原始换行符.
The problem with the above output is that I have lost my original newline delimiter used in the input file.
现在,如果我使用以下命令在输出文件中保留换行符定界符,则必须将-tokenize选项设置为false:
Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:
java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt
此代码的问题在于它完全弄乱了输出:
The problem with this code is that it totally messed up the output:
you_PRP have_VBP to_TO let's_NNS
come_VB and_CC see_VB me._NN
让我们和我在一起.标记不当.
Here let's and me. are tagged inappropriately.
我的问题是如何在输出文件中保留换行符而又不弄乱令牌化?
My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?
推荐答案
答案应该是使用以下命令:
The answer should have been to use the command:
java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -textFile input.txt > output.txt
但是有一个错误,在3.1.3版(可能是所有较早的版本)中,它不起作用(忽略换行符).它将在版本3.1.4+中运行.
But there was a bug and it didn't work (ignored the newlines) in version 3.1.3 (and perhaps all earlier versions). It will work in version 3.1.4+.
同时,如果文本量很小,则可以尝试使用Stanford Parser(其中相应标志的名称不同,因此为-sentences newline
).
In the meantime, if the amount of text is small, you might try using the Stanford Parser (where the corresponding flag is named differently so it's -sentences newline
).
这篇关于Stanford POS Tagger:如何在输出中保留换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!