为什么在 Java 8 split 中有时会在结果数组的开头删除空字符串? [英] Why in Java 8 split sometimes removes empty strings at start of result array?

查看:26
本文介绍了为什么在 Java 8 split 中有时会在结果数组的开头删除空字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Java 8 之前,当我们拆分空字符串时,例如

Before Java 8 when we split on empty string like

String[] tokens = "abc".split("");

split 机制会在标有 |

split mechanism would split in places marked with |

|a|b|c|

因为每个字符前后都存在空格"".因此,它首先会生成这个数组

because empty space "" exists before and after each character. So as result it would generate at first this array

["", "a", "b", "c", ""]

及以后将删除尾随空字符串(因为我们没有明确提供负值给limit 参数)所以它最终会返回

and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit argument) so it will finally return

["", "a", "b", "c"]


在 Java 8 中,拆分机制似乎发生了变化.现在当我们使用


In Java 8 split mechanism seems to have changed. Now when we use

"abc".split("")

我们将得到 [a", b", c"] 数组而不是 [", a", b", "c"].

we will get ["a", "b", "c"] array instead of ["", "a", "b", "c"].

我的第一个猜测是,也许现在前导空字符串也被删除,就像尾随空字符串一样.

My first guess was that maybe now leading empty strings are also removed just like trailing empty strings.

但是这个理论失败了,因为

But this theory fails, since

"abc".split("a")

返回["", "bc"],因此未删除前导空字符串.

returns ["", "bc"], so leading empty string was not removed.

有人可以解释一下这里发生了什么吗?split 的规则在 Java 8 中有何变化?

Can someone explain what is going on here? How rules of split have changed in Java 8?

推荐答案

String.split(调用 Pattern.split)的行为在 Java 7 和 Java 之间发生变化8.

The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.

比较Java 7Java 8,我们观察到添加了以下子句:

Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:

当输入序列的开头存在正宽度匹配时,结果数组的开头将包含一个空的前导子字符串.然而,开头的零宽度匹配永远不会产生这样的空前导子串.

When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

同样的子句也被添加到 Java 8,与 Java 7.

The same clause is also added to String.split in Java 8, compared to Java 7.

让我们比较 Java 7 和 Java 8 中参考实现的 Pattern.split 代码.该代码是从 grepcode 中检索到的,版本为 7u40-b43 和 8-b132.

Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Java 8

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Java 8 中添加的以下代码排除了输入字符串开头的零长度匹配,这解释了上述行为.

The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }

保持兼容性

Java 8 及更高版本中的以下行为

为了使 split 在不同版本中的行为一致并与 Java 8 中的行为兼容:

Maintaining compatibility

Following behavior in Java 8 and above

To make split behaves consistently across versions and compatible with the behavior in Java 8:

  1. 如果您的正则表达式可以匹配零长度字符串,只需在正则表达式的末尾添加(?!A)并换行非捕获组中的原始正则表达式 (?:...)(如有必要).
  2. 如果您的正则表达式不能匹配零长度字符串,您无需执行任何操作.
  3. 如果您不知道正则表达式是否可以匹配零长度字符串,请执行步骤 1 中的两个操作.
  1. If your regex can match zero-length string, just add (?!A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
  2. If your regex can't match zero-length string, you don't need to do anything.
  3. If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.

(?!A) 检查字符串没有在字符串的开头结束,这意味着匹配是字符串开头的空匹配.

(?!A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

没有通用的解决方案可以使 split 向后兼容 Java 7 及更早版本,除非替换所有 split 实例以指向您自己的自定义实现.

There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.

这篇关于为什么在 Java 8 split 中有时会在结果数组的开头删除空字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆