Stanford POS Tagger:如何在输出中保留换行符? [英] Stanford POS Tagger: How to preserve newlines in the output?

查看:71
本文介绍了Stanford POS Tagger:如何在输出中保留换行符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的input.txt文件包含以下示例文本:

My input.txt file contains the following sample text:

你必须让我们
过来看看我.

you have to let's
come and see me.

现在,如果我使用默认命令调用Stanford POS标记器:

Now if I invoke the Stanford POS tagger with the default command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt

在output.txt文件中得到以下内容:

I get the following in my output.txt file:

you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.

上述输出的问题是我丢失了输入文件中使用的原始换行符.

The problem with the above output is that I have lost my original newline delimiter used in the input file.

现在,如果我使用以下命令在输出文件中保留换行符定界符,则必须将-tokenize选项设置为false:

Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt 

此代码的问题在于它完全弄乱了输出:

The problem with this code is that it totally messed up the output:

you_PRP have_VBP to_TO let's_NNS  
come_VB and_CC see_VB me._NN

让我们和我在一起.标记不当.

Here let's and me. are tagged inappropriately.

我的问题是如何在输出文件中保留换行符而又不弄乱令牌化?

My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?

推荐答案

答案应该是使用以下命令:

The answer should have been to use the command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -textFile input.txt > output.txt 

但是有一个错误,在3.1.3版(可能是所有较早的版本)中,它不起作用(忽略换行符).它将在版本3.1.4+中运行.

But there was a bug and it didn't work (ignored the newlines) in version 3.1.3 (and perhaps all earlier versions). It will work in version 3.1.4+.

同时,如果文本量很小,则可以尝试使用Stanford Parser(其中相应标志的名称不同,因此为-sentences newline).

In the meantime, if the amount of text is small, you might try using the Stanford Parser (where the corresponding flag is named differently so it's -sentences newline).

这篇关于Stanford POS Tagger:如何在输出中保留换行符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆