将段落分为带标题和数字的句子 [英] Split paragraph into sentences with titles and numbers

查看:186
本文介绍了将段落分为带标题和数字的句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Java中的BreakIterator类将段落分成句子。这是我的代码:

I'm using the BreakIterator class in Java to break paragraph into sentences. This is my code :

public Map<String, Double> breakSentence(String document) {
    sentences = new HashMap<String, Double>();
    BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
    bi.setText(document);

    Double tfIdf = 0.0;
    int start = bi.first();
    for(int end = bi.next(); end != BreakIterator.DONE; start = end, end = bi.next()) {
        String sentence = document.substring(start, end);

        sentences.put(sentence, tfIdf);
    }

    return sentences;
}

问题是当段落包含标题或数字时,例如:

The problem is when the paragraph contain titles or numbers, for example :

Roberts教授试图通过编写1.200行代码解决问题。

"Prof. Roberts trying to solve a problem by writing a 1.200 lines of code."

我的代码将会是什么产品是:

What my code will produce is :

sentences :
Prof
Roberts trying to solve a problem by writing a 1
200 lines of code

由于标题和数字的句号而不是1句话。

Instead of 1 single sentence because of the period in titles and numbers.

有没有办法解决这个问题,用Java处理标题和数字?

Is there a way to fix this to handle titles and numbers with Java?

推荐答案

这是一个棘手的情况,我提出了一个棘手的解决方案,但它仍然可行。我是Java的新手,所以如果经验丰富的老手想要编辑或评论它并使其更专业,请让我看起来更好。

Well this is a bit of a tricky situation, and I've come up with a sticky solution, but it works nevertheless. I'm new to Java myself so if a seasoned veteran wants to edit this or comment on it and make it more professional by all means, please make me look better.

我基本上已经添加了一些控制措施来检查你已经检查过的内容,看看是否存在像Dr. Mrs. Mrs.等教授这样的单词。如果这些单词存在,它只是跳过那个中断并移动到下一个休息时间(保持原始的起始位置)寻找NEXT结束(最好是一个不会结束另一个博士或先生等。)

I basically added some control measures to what you already have to check and see if words exist like Dr. Prof. Mr. Mrs. etc. and if those words exist, it just skips over that break and moves to the next break (keeping the original start position) looking for the NEXT end (preferably one that doesn't end after another Dr. or Mr. etc.)

我包括我的完整程序所以你可以看到这一切:

I'm including my complete program so you can see it all:

import java.text.BreakIterator;
import java.util.*;

public class TestCode {

    private static final String[] ABBREVIATIONS = {
        "Dr." , "Prof." , "Mr." , "Mrs." , "Ms." , "Jr." , "Ph.D."
    };

    public static void main(String[] args) throws Exception {

        String text = "Prof. Roberts and Dr. Andrews trying to solve a " +
                      "problem by writing a 1.200 lines of code. This will " +
                      "work if Mr. Java writes solid code.";

        for (String s : breakSentence(text)) {
              System.out.println(s);
        }
    }

    public static List<String> breakSentence(String document) {

        List<String> sentenceList = new ArrayList<String>();
        BreakIterator bi = BreakIterator.getSentenceInstance(Locale.US);
        bi.setText(document);
        int start = bi.first();
        int end = bi.next();
        int tempStart = start;
        while (end != BreakIterator.DONE) {
            String sentence = document.substring(start, end);
            if (! hasAbbreviation(sentence)) {
                sentence = document.substring(tempStart, end);
                tempStart = end;
                sentenceList.add(sentence);
            }
            start = end; 
            end = bi.next();
        }
        return sentenceList;
    }

    private static boolean hasAbbreviation(String sentence) {
        if (sentence == null || sentence.isEmpty()) {
            return false;
        }
        for (String w : ABBREVIATIONS) {
            if (sentence.contains(w)) {
                return true;
            }
        }
        return false;
    }
}

这是做什么的,基本上设置了两个起点。原始起始点(你使用的那个)仍在做同样的事情,但是temp开始不会移动,除非字符串看起来准备好成为一个句子。它采用第一句话:

What this does, is basically set up two starting points. The original starting point (the one you used) is still doing the same thing, but temp start doesn't move unless the string looks ready to be made into a sentence. It take the first sentence:

"Prof."

并检查是否因为一个奇怪的词而破坏(即它是否有Prof. Dr.或者在可能导致该中断的句子中w / e)如果确实如此,则tempStart不会移动,它会停留在那里并等待下一个块返回。在我稍微复杂一点的句子中,下一个块也有一个奇怪的词弄乱了休息时间:

and checks to see if that broke because of a weird word (ie does it have Prof. Dr. or w/e in the sentence that might have caused that break) if it does, then tempStart doesn't move, it stays there and waits for the next chunk to come back. In my slightly more elaborate sentence the next chunk also has a weird word messing up the breaks:

"Roberts and Dr."

它需要那个块,因为它有一个博士,它继续到第三个块句子:

It takes that chunk and because it has a Dr. in it it continues on to the third chunk of sentence:

"Andrews trying to solve a problem by writing a 1.200 lines of code."

一旦它到达第三个被破坏的块并且没有任何可能导致错误破坏的奇怪标题,它然后从临时开始(仍然在开始时)开始到当前结束,基本上将所有三个部分连接在一起。

Once it reaches the third chunk that was broken and without any wierd titles that may have caused a false break, it then starts from temp start (which is still at the beginning) to the current end, basically joining all three parts together.

现在它设置临时开始到当前'结束'并继续。

Now it sets the temp start to the current 'end' and continues.

就像我说这可能不是一个迷人的方式来得到你想要的,但没有其他人志愿,它的工作耸肩

Like I said this may not be a glamorous way to get what you want, but nobody else volunteered and it works shrug

这篇关于将段落分为带标题和数字的句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆