使用 String.split() 提取单词对 [英] Extracting pairs of words using String.split()

查看:22
本文介绍了使用 String.split() 提取单词对的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定:

String input = "one two three four five six seven";

是否有适用于 String.split() 一次抓取(最多)两个单词,例如:

Is there a regex that works with String.split() to grab (up to) two words at a time, such that:

String[] pairs = input.split("some regex");
System.out.println(Arrays.toString(pairs));

结果如下:

[one two, three four, five six, seven]

<小时>

这个问题是关于拆分正则表达式.它不是关于寻找解决方法"或其他使其以另一种方式工作"的解决方案.


This question is about the split regex. It is not about "finding a work-around" or other "making it work in another way" solutions.

推荐答案

目前(最后在 Java 14 上测试)可以使用 split() 来实现,但在现实世界中不要使用这种方法,因为它看起来像是基于错误,因为 Java 中的后视应该有明显的最大长度,但是这个解决方案使用了 w+ 它不尊重这个限制并且不知何故仍然有效 - 所以如果它是一个将在以后的版本中修复的错误,此解决方案将停止工作.

Currently (last tested on Java 14) it is possible to do it with split(), but in real world don't use this approach since it looks like it is based on bug since look-behind in Java should have obvious maximum length, but this solution uses w+ which doesn't respect this limitation and somehow still works - so if it is a bug which will be fixed in later releases this solution will stop working.

改为使用带有正则表达式的 PatternMatcher 类,例如 w+s+w+ ,除了更安全之外,还避免了维护对于将继承此类代码的人来说是地狱(请记住,永远编码,好像最终维护您代码的人是一个知道您住在哪里的暴力精神病患者").

Instead use Pattern and Matcher classes with regex like w+s+w+ which aside from being safer also avoids maintenance hell for person who will inherit such code (remember to "Always code as if the person who ends up maintaining your code is a violent psychopath who knows where you live").

这是您要找的吗?
(您可以将 \w 替换为 \S 以包含所有非空格字符,但对于此示例,我将保留 \w 因为使用 \w\s 然后 \S\s)

Is this what you are looking for?
(you can replace \w with \S to include all non-space characters but for this example I will leave \w since it is easier to read regex with \w\s then \S\s)

String input = "one two three four five six seven";
String[] pairs = input.split("(?<!\G\w+)\s");
System.out.println(Arrays.toString(pairs));

输出:

[one two, three four, five six, seven]


G 是前一个匹配,(?<!regex) 是负向后视.


G is previous match, (?<!regex) is negative lookbehind.

split中,我们正在尝试

  1. 查找空格 ->\s
  2. 未预测到的 ->(?
  3. 顺便说一句 ->\w+
  4. 与先前匹配的(空格)->\G
  5. 在它之前 ->\G\w+.

我开始时唯一的困惑是它如何用于第一个空间,因为我们希望该空间被忽略.重要信息是 \G 在开始匹配字符串 ^ 的开始.

Only confusion that I had at start was how would it work for first space since we want that space to be ignored. Important information is that \G at start matches start of the String ^.

所以在负后视中的第一次迭代正则表达式之前看起来像 (?<!^\w+) 并且因为第一个空格 do^\w+ 之前,它不能匹配拆分.下一个空间不会有这个问题,所以它会被匹配并且关于它的信息(比如它在input字符串中的位置)将被存储在\G 并稍后在下一个否定后视中使用.

So before first iteration regex in negative look-behind will look like (?<!^\w+) and since first space do have ^\w+ before, it can't be match for split. Next space will not have this problem, so it will be matched and informations about it (like its position in input String) will be stored in \G and used later in next negative look-behind.

因此,对于第三个空格,正则表达式将检查之前是否存在匹配的空格 \G 和单词 \w+ 之前.由于这个测试的结果是肯定的,否定的后视不会接受它,所以这个空间不会被匹配,但是第 4 个空间不会有这个问题,因为它之前的空间不会与存储在 \G (它在 input 字符串中会有不同的位置).

So for 3rd space regex will check if there is previously matched space \G and word \w+ before it. Since result of this test will be positive, negative look-behind wont accept it so this space wont be matched, but 4th space wont have this problem because space before it wont be the same as stored in \G (it will have different position in input String).

另外,如果有人想分开让我们说每 3 个空格,您可以使用此表单(基于 @maybeWeCouldStealAVananswer 在我发布此答案片段时已被删除)

Also if someone would like to separate on lets say every 3rd space you can use this form (based on @maybeWeCouldStealAVan's answer which was deleted when I posted this fragment of answer)

input.split("(?<=\G\w{1,100}\s\w{1,100}\s\w{1,100})\s")

您可以使用一些更大的值来代替 100,该值至少是字符串中最长单词的长度.

Instead of 100 you can use some bigger value that will be at least the size of length of longest word in String.

我刚刚注意到我们也可以使用 + 而不是 {1,maxWordLength} 如果我们想用每个奇数分割,比如每 3、5、7例子

I just noticed that we can also use + instead of {1,maxWordLength} if we want to split with every odd number like every 3rd, 5th, 7th for example

String data = "0,0,1,2,4,5,3,4,6,1,3,3,4,5,1,1";
String[] array = data.split("(?<=\G\d+,\d+,\d+,\d+,\d+),");//every 5th comma 

这篇关于使用 String.split() 提取单词对的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆