Stanford POS Tagger:如何在输出中保留换行符? [英] Stanford POS Tagger: How to preserve newlines in the output?

查看：71 发布时间：2020/5/18 0:59:41 java text nlp stanford-nlp pos-tagger

本文介绍了Stanford POS Tagger:如何在输出中保留换行符?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的input.txt文件包含以下示例文本:

My input.txt file contains the following sample text:

你必须让我们
过来看看我.

you have to let's
come and see me.

现在，如果我使用默认命令调用Stanford POS标记器:

Now if I invoke the Stanford POS tagger with the default command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -textFile input.txt > output.txt

在output.txt文件中得到以下内容:

I get the following in my output.txt file:

you_PRP have_VBP to_TO let_VB 's_POS come_VB and_CC see_VB me_PRP ._.

上述输出的问题是我丢失了输入文件中使用的原始换行符.

The problem with the above output is that I have lost my original newline delimiter used in the input file.

现在，如果我使用以下命令在输出文件中保留换行符定界符，则必须将-tokenize选项设置为false:

Now, if I use the following command to preserve my newline sentence delimiter in the output file I have to set -tokenize option to false:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -tokenize false -textFile input.txt > output.txt

此代码的问题在于它完全弄乱了输出:

The problem with this code is that it totally messed up the output:

you_PRP have_VBP to_TO let's_NNS  
come_VB and_CC see_VB me._NN

让我们和我在一起.标记不当.

Here let's and me. are tagged inappropriately.

我的问题是如何在输出文件中保留换行符而又不弄乱令牌化?

My question is how can I preserve the newline delimiters in the output file without messing up the tokenization?

推荐答案

答案应该是使用以下命令:

The answer should have been to use the command:

java -classpath stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/wsj-0-18-bidirectional-distsim.tagger -sentenceDelimiter newline -textFile input.txt > output.txt

但是有一个错误，在3.1.3版(可能是所有较早的版本)中，它不起作用(忽略换行符).它将在版本3.1.4+中运行.

But there was a bug and it didn't work (ignored the newlines) in version 3.1.3 (and perhaps all earlier versions). It will work in version 3.1.4+.

同时，如果文本量很小，则可以尝试使用Stanford Parser(其中相应标志的名称不同，因此为-sentences newline).

In the meantime, if the amount of text is small, you might try using the Stanford Parser (where the corresponding flag is named differently so it's -sentences newline).

这篇关于Stanford POS Tagger:如何在输出中保留换行符?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Stanford POS Tagger:如何在输出中保留换行符? [英] Stanford POS Tagger: How to preserve newlines in the output?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Stanford POS Tagger:如何在输出中保留换行符? [英] Stanford POS Tagger: How to preserve newlines in the output?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭