斯坦福pos标记器中的xml格式 [英] xml format in stanford pos tagger

查看:100
本文介绍了斯坦福pos标记器中的xml格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我标记了20个句子,这是我的代码:

i have tagged 20 sentences and this is my code:

public class myTag {

public static void main(String[] args) {

    Properties props = new Properties();

    try {
        props.load(new FileReader("D:/tagger/english-bidirectional-distsim.tagger.props"));
    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    MaxentTagger tagger = new MaxentTagger("D:/tagger/english-bidirectional-distsim.tagger",props);

    //==================================================================================================
    try (BufferedReader br = new BufferedReader(new FileReader("C:/Users/chelsea/Desktop/EN/EN.txt")))
    {

        String sCurrentLine;

        while ((sCurrentLine = br.readLine()) != null) {

            String tagged = tagger.tagString(sCurrentLine);
            System.out.println(tagged);
        }

    } catch (IOException e) {
        e.printStackTrace();
    }

}

}

这是输出:

正如您在句子节点中看到的那样,它具有Id属性,并且在这里它经常为0而不应该是0.我期望值= 0、1、2、3、4,... 我不明白我的代码有什么问题.

as you can see in sentence node it has a Id attribute and here it's constantly=0 which it should not be.i expect the value=0,1,2,3,4,... i don't understand what is wrong with my code.

推荐答案

斯坦福POS标记器(严格来说,是在POS注释器之前应用的句子拆分器)为每个输入文本生成句子的ID. 因此,您要求tagger标记由一个句子组成的sCurrentLine,此文本被拆分为多个句子-实际上,只有一个,id = 0;然后您要求标记下一个迭代中的另一个文本-sCurrentLine-它再次是唯一的句子,因此它是id = 0的第一个句子;等等.

Stanford POS tagger (strictly speaking, sentence splitter that is applied before POS annotator) generates ids for sentences per input text. So, you ask tagger to tag sCurrentLine consisting of one sentence, this text is split into sentences - actually, just one, with id = 0; then you ask to tag another text - sCurrentLine from the next iteration - and it again is the only sentence and thereby it is the first sentence with id = 0; and so on.

因此,如果要使用正确的ID,请首先创建整个文本,然后将其传递给tagger.但是,如果您的输入文本已经被句子分割了,那么最好将它们保持原样(并在需要时由您自己在循环中生成ID).

Thus, if you want correct ids, firstly create the whole text, then pass it to tagger. However, if your input text is already split by sentences, it'll be better to leave things as they are (and generate ids by yourself in the loop, if you need them).

这篇关于斯坦福pos标记器中的xml格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆