如何从解析的文本中提取名词短语 [英] How to extract noun phrases from the parsed text

查看:99
本文介绍了如何从解析的文本中提取名词短语的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用选区解析器解析了一个文本,将结果复制到如下文本文件中:

I have parsed a text with constituency parser copy the result in a text file like below:

(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP we)) (VP (VBD went) (PP (TO to)....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (PRP I)) (VP (VBD was) (NP (NP (EX...
(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP I)) (VP (VBD went) (PP (TO to.....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (NNP Jim)) (VP (VBD was) (NP (NP (....
(ROOT (S (S (NP (PRP I)) (VP (VBD started) (S (VP (VBG talking) (PP.....

我需要从这个文本文件中提取所有NounPhrases(NP)。我编写了以下代码,只从每行中提取第一个NP但是,我需要提取所有名词短语。我的代码是:

I need to extract all NounPhrases (NP) from this text file. I wrote the following code that extract only the first NP from each line. However, I need to extract all noun phrases. My Code is:

public class nounPhrase {

    public static int findClosingParen(char[] text, int openPos) {
        int closePos = openPos;
        int counter = 1;
        while (counter > 0) {
            char c = text[++closePos];
            if (c == '(') {

                counter++;
            }
            else if (c == ')') {
                counter--;
            }
        }
        return closePos;
    }

     public static void main(String[] args) throws IOException {

        ArrayList npList = new ArrayList ();
        String line;
        String line1;
        int np;

        String Input = "/local/Input/Temp/Temp.txt";

        String Output = "/local/Output/Temp/Temp-out.txt";  

        FileInputStream  fis = new FileInputStream (Input);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"
        ));
        while ((line = br.readLine())!= null){
        char[] lineArray = line.toCharArray();
        np = findClosingParen (lineArray, line.indexOf("(NP"));
        line1 = line.substring(line.indexOf("(NP"),np+1);
        System.out.print(line1+"\n");
        }
    }
}

输出为:

(NP (NN Yesterday))...I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also
(NP (NNP Jim)).....I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also

我的代码只使用右边的第一个NP及其右括号,但我需要从文本中提取所有NP。

My code only takes the first NP on each line with its closing parenthesis but I need to extract all NPs from the text.

推荐答案

编写自己的树解析器是一个很好的练习(!),如果你只想要结果,最简单的方法是使用更多Stanford NLP工具的功能,即 Tregex ,专为就是这样的事情。您可以在循环时将最终的更改为以下内容:

While writing your own tree parser is a good exercise (!), if you just want results, the easiest way is to use more of the functionality of the Stanford NLP tools, namely Tregex, which is designed for just such things. You can change your final while loop to something like this:

TregexPattern tPattern = TregexPattern.compile("NP");
while ((line = br.readLine()) != null) {
    Tree t = Tree.valueOf(line);
    TregexMatcher tMatcher = tPattern.matcher(t);
    while (tMatcher.find()) {
      System.out.println(tMatcher.getMatch());
    }
}

这篇关于如何从解析的文本中提取名词短语的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆