从复杂(混合)句子中提取简单句子的算法? [英] algorithm to extract simple sentences from complex(mixed) sentences?

查看:77
本文介绍了从复杂(混合)句子中提取简单句子的算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有一种算法可用于从段落中提取简单的句子?

Is there an algorithm that can be used to extract simple sentences from paragraphs?

我的最终目标是以后对生成的简单句子运行另一种算法,以确定作者的情感.

My ultimate goal is to later run another algorithm on the resulted simple sentence to determine the author's sentiment.

我已经从Chae-Deug Park等来源进行了研究,但没有人讨论准备简单的句子作为训练数据.

I've researched this from sources such as Chae-Deug Park but none discuss preparing simple sentences as training data.

预先感谢

推荐答案

我刚刚使用了openNLP.

I have just used openNLP for the same.

public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException,
        InvalidFormatException {

    InputStream is = new FileInputStream("resources/models/en-sent.bin");
    SentenceModel model = new SentenceModel(is);
    SentenceDetectorME sdetector = new SentenceDetectorME(model);

    String[] sentDetect = sdetector.sentDetect(paragraph);
    is.close();
    return Arrays.asList(sentDetect);
}

示例

    //Failed at Hi.
    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone

    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.

    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

只有在人为错误时它才会失败.例如. 博士."缩写应为大写D,并且两个句子之间至少应有1个空格.

It failed only when there is a human mistake. Eg. "Dr." abbreviation should have capital D, and there is at least 1 space is expected between 2 sentences.

您还可以通过以下方式使用 RE 来实现它;

You can also achieve it using RE in following way;

public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){
    List<String> sentences = new ArrayList<String>();
    Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
    Matcher reMatcher = re.matcher(paragraph);
    while (reMatcher.find()) {
        sentences.add(reMatcher.group());
    }
    return sentences;

}

示例

    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Mr., mrs.
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at U.S.
    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

但是错误率很高.另一种方法是使用BreakIterator;

But errors are competitively high. Another way is using BreakIterator;

public static List<String> breakIntoSentencesBreakIterator(String paragraph){
    List<String> sentences = new ArrayList<String>();
    BreakIterator sentenceIterator =
            BreakIterator.getSentenceInstance(Locale.ENGLISH);
    BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
    sentenceInstance.setText(paragraph);

    int end = sentenceInstance.last();
     for (int start = sentenceInstance.previous();
          start != BreakIterator.DONE;
          end = start, start = sentenceInstance.previous()) {
         sentences.add(paragraph.substring(start,end));
     }

     return sentences;
}

示例:

    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Mr.
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));


    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

基准化:

  • 自定义RE:7毫秒
  • BreakIterator:143毫秒
  • openNlp:255毫秒

这篇关于从复杂(混合)句子中提取简单句子的算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆