把句子分成单词和标点符号 [英] divide sentence into words and punctuations

查看:42
本文介绍了把句子分成单词和标点符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将 Sentence 类解析为单词和标点符号(空格被视为标点符号),然后将其全部添加到通用 ArrayList 中.

I need to parse class Sentence into word and punctuation (whitespace is considered as a punctuation mark), then add all of it into general ArrayList<Sentence>.

例句:

一个人、一个计划、一条运河——巴拿马!
A => 字
whitespase => 标点符号
人 => 字
, + 空格 => 标点符号
一个 => 字
[...]

A man, a plan, a canal — Panama!
A => word
whitespase => punctuation
man => word
, + space => punctuation
a => word
[...]

我尝试一次一个字符地阅读整个句子,然后收集相同的内容并从该集合中创建新单词或新的标点.

I tried to read this whole sentence one character at a time and collect the same and create new word or new Punctuation from this collection.

这是我的代码:

public class Sentence {

    private String sentence;
    private LinkedList<SentenceElement> elements;

    /**
     * Constructs a sentence.
     * @param aText a string containing all characters of the sentence
     */
    public Sentence(String aText) {
        sentence = aText.trim();
        splitSentence();
    }

    public String getSentence() {
        return sentence;
    }

    public LinkedList<SentenceElement> getElements() {
        return elements;
    }

    /**
     * Split sentance into words and punctuations
     */
    private void splitSentence() {
        if (sentence == "" || sentence == null || sentence == "
") {
            return;
        }

        StringBuilder builder = new StringBuilder();

        int j = 0;
        boolean mark = false;
        while (j < sentence.length()) {
            //char current = sentence.charAt(j);

            while (Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Punctuation(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            } 
            mark = true;

            while (!Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Word(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            }
            mark = true;
        }
    }

但是 splitSentence() 的逻辑不能正常工作.而且我找不到正确的解决方案.

But logic of splitSentence() isn't work correctly. And I can't to find right solution for it.

我想在我们读取第一个字符时实现这个 => 添加到构建器 => 直到下一个元素是相同类型(字母或标点符号) 继续添加到构建器 => 当下一个元素与构建器的内容不同时 => 创建新单词或标点符号并设置生成器开始.

I want to implement this as we read first character => add to builder => till next element are the same type (letter or punctuation) keep adding to builder => when next element are different than content of builder => create new word or punctuation and set builder to start.

再次执行相同的逻辑.

如何以正确的方式实现这种检查逻辑?

推荐答案

在单词边界上拆分字符串(第一个除外):

Split the string on word boundaries (except the first):

String[] parts = sentence.split("(?<!^)\b");

数组将包含交替的单词/标点符号/单词/标点符号/单词等

The array will contain alternating word/punctuation/word/punctuation/word etc.

这是一些测试代码:

String sentence = "A man, a plan, a canal — Panama!";
String[] parts = sentence.split("(?<!^)\b");
for (String part : parts)
    System.out.println('"' + part + "" (" + (part.matches("\w+") ? "word" : "punctuation") + ")");

输出:

"A" (word)
" " (punctuation)
"man" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"plan" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"canal" (word)
" — " (punctuation)
"Panama" (word)
"!" (punctuation)

这篇关于把句子分成单词和标点符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆