为什么在 Java 8 split 中有时会在结果数组的开头删除空字符串? [英] Why in Java 8 split sometimes removes empty strings at start of result array?
问题描述
在 Java 8 之前,当我们拆分空字符串时,例如
Before Java 8 when we split on empty string like
String[] tokens = "abc".split("");
split 机制会在标有 |
split mechanism would split in places marked with |
|a|b|c|
因为每个字符前后都存在空格""
.因此,它首先会生成这个数组
because empty space ""
exists before and after each character. So as result it would generate at first this array
["", "a", "b", "c", ""]
及以后将删除尾随空字符串(因为我们没有明确提供负值给limit
参数)所以它最终会返回
and later will remove trailing empty strings (because we didn't explicitly provide negative value to limit
argument) so it will finally return
["", "a", "b", "c"]
在 Java 8 中,拆分机制似乎发生了变化.现在当我们使用
In Java 8 split mechanism seems to have changed. Now when we use
"abc".split("")
我们将得到 [a", b", c"]
数组而不是 [", a", b", "c"]
.
we will get ["a", "b", "c"]
array instead of ["", "a", "b", "c"]
.
我的第一个猜测是,也许现在前导空字符串也被删除,就像尾随空字符串一样.
My first guess was that maybe now leading empty strings are also removed just like trailing empty strings.
但是这个理论失败了,因为
But this theory fails, since
"abc".split("a")
返回["", "bc"]
,因此未删除前导空字符串.
returns ["", "bc"]
, so leading empty string was not removed.
有人可以解释一下这里发生了什么吗?split
的规则在 Java 8 中有何变化?
Can someone explain what is going on here? How rules of split
have changed in Java 8?
推荐答案
String.split
(调用 Pattern.split
)的行为在 Java 7 和 Java 之间发生变化8.
The behavior of String.split
(which calls Pattern.split
) changes between Java 7 and Java 8.
比较Java 7 和 Java 8,我们观察到添加了以下子句:
Comparing between the documentation of Pattern.split
in Java 7 and Java 8, we observe the following clause being added:
当输入序列的开头存在正宽度匹配时,结果数组的开头将包含一个空的前导子字符串.然而,开头的零宽度匹配永远不会产生这样的空前导子串.
When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.
The same clause is also added to String.split
in Java 8, compared to Java 7.
让我们比较 Java 7 和 Java 8 中参考实现的 Pattern.split
代码.该代码是从 grepcode 中检索到的,版本为 7u40-b43 和 8-b132.
Let us compare the code of Pattern.split
of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Java 8
public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);
// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}
// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};
// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());
// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}
Java 8 中添加的以下代码排除了输入字符串开头的零长度匹配,这解释了上述行为.
The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
保持兼容性
Java 8 及更高版本中的以下行为
为了使 split
在不同版本中的行为一致并与 Java 8 中的行为兼容:
Maintaining compatibility
Following behavior in Java 8 and above
To make split
behaves consistently across versions and compatible with the behavior in Java 8:
- 如果您的正则表达式可以匹配零长度字符串,只需在正则表达式的末尾添加
(?!A)
并换行非捕获组中的原始正则表达式(?:...)
(如有必要). - 如果您的正则表达式不能匹配零长度字符串,您无需执行任何操作.
- 如果您不知道正则表达式是否可以匹配零长度字符串,请执行步骤 1 中的两个操作.
- If your regex can match zero-length string, just add
(?!A)
at the end of the regex and wrap the original regex in non-capturing group(?:...)
(if necessary). - If your regex can't match zero-length string, you don't need to do anything.
- If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.
(?!A)
检查字符串没有在字符串的开头结束,这意味着匹配是字符串开头的空匹配.
(?!A)
checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.
没有通用的解决方案可以使 split
向后兼容 Java 7 及更早版本,除非替换所有 split
实例以指向您自己的自定义实现.
There is no general solution to make split
backward-compatible with Java 7 and prior, short of replacing all instance of split
to point to your own custom implementation.
这篇关于为什么在 Java 8 split 中有时会在结果数组的开头删除空字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!