Java String.split() 有时会给出空白字符串 [英] Java String.split() sometimes giving blank strings

查看：17 发布时间：2021/12/28 11:57:07 java regex string split

本文介绍了Java String.split() 有时会给出空白字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在制作一个基于文本的掷骰子.它接受像2d10+5"这样的字符串，并返回一个作为滚动结果的字符串.我的问题出现在分词器中，该分词器将字符串拆分为有用的部分，以便我解析为信息.

String[] tokens = message.split("(?=[dk\+\-])");

这产生了奇怪的、意想不到的结果.我不知道究竟是什么导致了它们.可能是正则表达式，我的误解，或者 Java 只是 Java.这是正在发生的事情:

3d6+4 生成字符串数组 [3, d6, +4].这是正确的.
d% 生成字符串数组 [d%].这是正确的.
d20 生成字符串数组 [d20].这是正确的.
d%+3 生成字符串数组 [, d%, +3].这是不正确的.
d20+2 生成字符串数组 [, d20, +2].这是不正确的.

在第四个和第五个示例中，一些奇怪的事情导致数组前面出现一个额外的空字符串.这不是字符串前面缺少数字，因为其他示例反驳了这一点.这不是百分号的存在，也不是加号.

现在我只是在空白字符串上继续执行 for 循环，但这感觉有点像创可贴解决方案.有谁知道是什么导致数组前面的空白字符串?我该如何解决?

解决方案

深入研究源代码，我发现了这种行为背后的确切问题.

String.split() 方法在内部使用 Pattern.split().split 方法在返回结果数组之前检查最后匹配的索引或是否确实存在匹配.如果最后匹配的索引是 0，则意味着您的模式仅匹配字符串开头的空字符串或根本不匹配，在这种情况下，返回的数组是单个元素包含相同元素的数组.

这是源代码:

public String[] split(CharSequence input, int limit) {整数索引 = 0;boolean matchLimited = limit >0;ArrayListmatchList = new ArrayList();匹配器 m = 匹配器(输入)；//在找到的每个匹配项之前添加段而(m.find()){if (!matchLimited || matchList.size()

如果上述代码中的最后一个条件 - index == 0 为真，则返回带有输入字符串的单元素数组.

现在，考虑 index 可以是 0 的情况.

当根本没有匹配项时.(如该条件上方的评论中所述)
如果在开头找到匹配，并且匹配字符串的长度为0，则if块中的index值(在while 循环) -
```
index = m.end();
```
将为 0.唯一可能的匹配字符串是空字符串(长度 = 0).这正是这里的情况.并且不应该有任何进一步的匹配，否则 index 将被更新为不同的索引.

因此，考虑您的情况:

对于 d%，在第一个 d 之前，只有一个模式匹配.因此索引值将是 0.但是由于没有进一步匹配，索引值没有更新，并且if条件变为true，并返回带有原始字符串的单元素数组.p>
对于 d20+2 将有两个匹配项，一个在 d 之前，一个在 + 之前.所以索引值将被更新，因此上面代码中的 ArrayList 将被返回，它包含作为分隔符分割的结果的空字符串，分隔符是字符串的第一个字符，正如已经解释过的在@Stema 的回答中.

因此，为了获得您想要的行为(仅当它不在开头时才在分隔符上拆分，您可以在正则表达式模式中添加否定后视):

"(?<!^)(?=[dk+-])"//不需要转义 + 和连字符(在最后时)

这将拆分为空字符串，后跟您的字符类，但不会在字符串的开头之前.

考虑在正则表达式模式 - "a(?=[dk+-])" 上拆分字符串 "ad%" 的情况.这将为您提供一个数组，其中第一个元素为空字符串.这里唯一的变化是，空字符串被替换为 a:

"ad%".split("a(?=[dk+-])");//打印 - `[, d%]`

为什么?那是因为匹配字符串的长度是1.所以第一次匹配后的索引值 - m.end() 不会是 0 而是 1，因此单元素数组获胜不会被退回.

I'm making a text based dice roller. It takes in strings like "2d10+5" and returns a string as a result of the roll(s). My problem is showing up in the tokenizer that splits the string into useful parts for me to parse into information.

String[] tokens = message.split("(?=[dk\+\-])");

This is yielding strange, unexpected results. I don't know exactly what is causing them. It could be the regex, my misunderstanding, or Java just being Java. Here's what's happening:

3d6+4 yields the string array [3, d6, +4]. This is correct.
d% yields the string array [d%]. This is correct.
d20 yields the string array [d20]. This is correct.
d%+3 yields the string array [, d%, +3]. This is incorrect.
d20+2 yields the string array [, d20, +2]. This is incorrect.

In the fourth and fifth example, something strange is causing an extra empty string to appear at the front of the array. It's not the lack of number at the front of the string, as other examples disprove that. It's not the presence of the percentage sign, nor the the plus sign.

For now I'm just continuing through the for loop on blank strings, but that feels sorta like a band-aid solution. Does anyone have any idea what causes the blank string at the front of the array? How can I fix it?

解决方案

Digging through the source code, I got the exact issue behind this behaviour.

The String.split() method internally uses Pattern.split(). The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is 0, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element.

Here's the source code:

public String[] split(CharSequence input, int limit) {
        int index = 0;
        boolean matchLimited = limit > 0;
        ArrayList<String> matchList = new ArrayList<String>();
        Matcher m = matcher(input);

        // Add segments before each match found
        while(m.find()) {
            if (!matchLimited || matchList.size() < limit - 1) {
                String match = input.subSequence(index, m.start()).toString();
                matchList.add(match);

                // Consider this assignment. For a single empty string match
                // m.end() will be 0, and hence index will also be 0
                index = m.end();
            } else if (matchList.size() == limit - 1) { // last one
                String match = input.subSequence(index,
                                                 input.length()).toString();
                matchList.add(match);
                index = m.end();
            }
        }

        // If no match was found, return this
        if (index == 0)
            return new String[] {input.toString()};

        // Rest of them is not required

If the last condition in the above code - index == 0, is true, then the single element array is returned with the input string.

Now, consider the cases when the index can be 0.

When there is no match at all. (As already in the comment above that condition)
If the match is found at the beginning, and the length of matched string is 0, then the value of index in the if block (inside the while loop) -
```
index = m.end();
```
will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else index would be updated to a different index.

So, considering your cases:

For d%, there is just a single match for the pattern, before the first d. Hence the index value would be 0. But since there isn't any further matches, the index value is not updated, and the if condition becomes true, and returns the single element array with original string.
For d20+2 there would be two matches, one before d, and one before +. So index value will be updated, and hence the ArrayList in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.

So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern):

"(?<!^)(?=[dk+-])"  // You don't need to escape + and hyphen(when at the end)

this will split on empty string followed by your character class, but not preceded by the beginning of the string.

Consider the case of splitting the string "ad%" on regex pattern - "a(?=[dk+-])". This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with a:

"ad%".split("a(?=[dk+-])");  // Prints - `[, d%]`

Why? That's because the length of the matched string is 1. So the index value after the first match - m.end() wouldn't be 0 but 1, and hence the single element array won't be returned.

这篇关于Java String.split() 有时会给出空白字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Java String.split() 有时会给出空白字符串 [英] Java String.split() sometimes giving blank strings

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Java String.split() 有时会给出空白字符串 [英] Java String.split() sometimes giving blank strings

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭