将句子分成单词和标点符号 [英] divide sentence into words and punctuations

查看:116
本文介绍了将句子分成单词和标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将类 Sentence 解析为单词和标点符号(空格被视为标点符号),然后将所有内容添加到常规 ArrayList< ;句子>

I need to parse class Sentence into word and punctuation (whitespace is considered as a punctuation mark), then add all of it into general ArrayList<Sentence>.

例句:


一个人,一个计划,一个运河 - 巴拿马!

A => word

whitespase =>标点符号

man => word

,+ space =>标点符号

a => word

[...]

A man, a plan, a canal — Panama!
A => word
whitespase => punctuation
man => word
, + space => punctuation
a => word
[...]

我试着一次一个字符地阅读整个句子并收集相同的内容,并从这个集合中创建新单词或新的标点符号

I tried to read this whole sentence one character at a time and collect the same and create new word or new Punctuation from this collection.

这是我的代码:

public class Sentence {

    private String sentence;
    private LinkedList<SentenceElement> elements;

    /**
     * Constructs a sentence.
     * @param aText a string containing all characters of the sentence
     */
    public Sentence(String aText) {
        sentence = aText.trim();
        splitSentence();
    }

    public String getSentence() {
        return sentence;
    }

    public LinkedList<SentenceElement> getElements() {
        return elements;
    }

    /**
     * Split sentance into words and punctuations
     */
    private void splitSentence() {
        if (sentence == "" || sentence == null || sentence == "\n") {
            return;
        }

        StringBuilder builder = new StringBuilder();

        int j = 0;
        boolean mark = false;
        while (j < sentence.length()) {
            //char current = sentence.charAt(j);

            while (Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Punctuation(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            } 
            mark = true;

            while (!Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Word(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            }
            mark = true;
        }
    }

但splitSentence()的逻辑无法正常工作。我无法找到合适的解决方案。

But logic of splitSentence() isn't work correctly. And I can't to find right solution for it.

我想实现这个,因为我们读取第一个字符=>添加到builder =>直到下一个元素是相同的类型(字母或标点符号)继续添加到builder =>当下一个元素不同于builder =>创建新单词或标点符号并将构建器设置为开始时。

I want to implement this as we read first character => add to builder => till next element are the same type (letter or punctuation) keep adding to builder => when next element are different than content of builder => create new word or punctuation and set builder to start.

再次执行相同的逻辑。

如何以正确的方式实现此检查逻辑?

How to implement this checking logic at right way?

推荐答案

在字边界上拆分字符串(第一个除外):

Split the string on word boundaries (except the first):

String[] parts = sentence.split("(?<!^)\\b");

数组将包含交替的单词/标点符号/单词/标点符号/单词等。

The array will contain alternating word/punctuation/word/punctuation/word etc.

以下是一些测试代码:

String sentence = "A man, a plan, a canal — Panama!";
String[] parts = sentence.split("(?<!^)\\b");
for (String part : parts)
    System.out.println('"' + part + "\" (" + (part.matches("\\w+") ? "word" : "punctuation") + ")");

输出:

"A" (word)
" " (punctuation)
"man" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"plan" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"canal" (word)
" — " (punctuation)
"Panama" (word)
"!" (punctuation)

这篇关于将句子分成单词和标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆