将java中的字符串拆分为相等长度的子字符串,同时保持字边界 [英] split a string in java into equal length substrings while maintaining word boundaries

查看:174
本文介绍了将java中的字符串拆分为相等长度的子字符串,同时保持字边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在保持单词边界的同时将字符串拆分为最大字符长度的相等部分?

How to split a string into equal parts of maximum character length while maintaining word boundaries?

比如说,如果我想拆分一个字符串hello world 进入最多7个字符的相等子字符串,它应该返回给我

Say, for example, if I want to split a string "hello world" into equal substrings of maximum 7 characters it should return me

"hello "

"world"

但我当前的实现返回

"hello w"

"orld   "

我使用以下代码取自将字符串拆分为Java中相等长度的子字符串以进行拆分将输入字符串分成相等的部分

I am using the following code taken from Split string to equal length substrings in Java to split the input string into equal parts

public static List<String> splitEqually(String text, int size) {
    // Give the list the right capacity to start with. You could use an array
    // instead if you wanted.
    List<String> ret = new ArrayList<String>((text.length() + size - 1) / size);

    for (int start = 0; start < text.length(); start += size) {
        ret.add(text.substring(start, Math.min(text.length(), start + size)));
    }
    return ret;
}

在将字符串拆分为子字符串时是否可以保持字边界?

Will it be possible to maintain word boundaries while splitting the string into substring?

更具体地说,我需要字符串拆分算法来考虑空格提供的单词边界,而不仅仅是在分割字符串时依赖字符长度,尽管这也需要被考虑在内但更像是字符的最大范围而不是硬编码的字符长度。

To be more specific I need the string splitting algorithm to take into account the word boundary provided by spaces and not solely rely on character length while splitting the string although that also needs to be taken into account but more like a max range of characters rather than a hardcoded length of characters.

推荐答案

如果我理解你的问题正确然后这个代码应该做你需要的(但它假设 maxLenght 等于或大于最长的单词

If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)

String data = "Hello there, my name is not importnant right now."
        + " I am just simple sentecne used to test few things.";
int maxLenght = 10;
Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
Matcher m = p.matcher(data);
while (m.find())
    System.out.println(m.group(1));

输出:

Hello
there, my
name is
not
importnant
right now.
I am just
simple
sentecne
used to
test few
things.

的短(或非)解释\\G\ \ * *(。{1,+ maxLenght +})(?= \\\\ | $)正则表达式:

(让我们记住,在Java \ 中不仅特殊于正则表达式,而且还在字符串文字中,所以要使用预定义的字符集,如 \\ \\ n 我们需要将其写为\\d因为我们需要转义 \ 也在字符串文字中)

(lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)


  • \ G - 是代表先前创建的匹配结束的锚,或者如果还没有匹配(当我们刚刚开始搜索时)字符串的开头(与 ^ 相同)

  • \s * - 表示零个或多个空格( \s 表示空格, * 零或多量词)

  • (。{1,+ maxLenght +}) - 让我们把它分成更多部分(在运行时:maxLenght 将保留一些数值li ke 10所以正则表达式会将其视为。{1,10}


    • 表示任何字符(实际上默认情况下,它可以表示除 \ n 或等行分隔符之外的任何字符\\ r \\ n ,但感谢 Pattern.DOTALL 标记它现在可以代表任何字符 - 你可以摆脱这种方法如果你想分别开始分割每个句子,因为它的开始将以新行打印

    • {1,10 } - 这是量词,它允许先前描述的元素出现1到10次(默认情况下会尝试找到匹配重复的最大数量),

    • 。{1,10} - 所以基于我们刚才说的,它只代表1到10个任何字符

    • - 括号创建,允许我们进行的结构保持匹配的特定部分(这里我们在 \\\\ * 之后添加括号,因为我们只想在空格后使用部分)

    • \G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
    • \s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
    • (.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
      • . represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
      • {1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
      • .{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
      • ( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)

      (?= \\\\ | $) - 是预见机制,确保文本与匹配。{1,10 } 将在它之后:

      (?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:


      • 空格( \\ s

      OR(写成 |

      结束字符串 $ 之后。

      所以感谢。{ 1,10} 我们最多可以匹配10个字符。但是在(?= \\\\ | $)之后,我们要求最后一个字符与匹配。{1,10} 不是未完成单词的一部分(后面必须有空格或字符串结尾)。

      So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).

      这篇关于将java中的字符串拆分为相等长度的子字符串,同时保持字边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆