Java - 使用最小大小和分隔符分割标记的正则表达式 [英] Java - Regex to Split Tokens With Minimum Size and Delimiters
问题描述
我知道我知道,有许多类似的问题,我可以说我读了所有这些问题。但是,我在正则表达式方面并不擅长,我无法弄清楚我需要的正则表达式。
I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.
我想在Java中拆分一个String,我有4个约束:
I want to split a String in Java, and I have 4 constraints:
- 分隔符是[。?!](句末)
- 十进制数字不应被标记化
- 不应删除分隔符。
- 每个标记的最小大小应为5
- The delimiters are [.?!] (end of the sentence)
- Decimal numbers shouldn't be tokenized
- The delimiters shouldn't be removed.
- The minimum size of each token should be 5
例如,输入:
Hello World!This回答价值1.45美元的美元。谢谢。
输出将是:
[Hello World !,这个答案价值1.45美元美元。谢谢。]
Up到目前为止,我得到了这个正则表达式的三个第一个约束的答案:
Up to now I got the answer for three first constraints by this regex:
text.split("(?<=[.!?])(?<!\\d)(?!\\d)");
我知道我应该使用 {5,}
在我的正则表达式的某个地方,但我试过的任何组合都不起作用。
And I know I should use {5,}
somewhere in my regex, but any combination that I tried doesn't work.
对于以下情况:我爱你美国怎么样?
如果它给我一个或两个句子并不重要,只要它没有将 S。
标记为单独的句子。
For cases like: "I love U.S. How about you?"
it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S.
as a separate sentence.
最后,介绍一个很好的正则表达式教程。
Finally, introducing a good tutorial of regex is appreciated.
更新:正如评论中提到的 Chris ,几乎不可能解决这样的问题(涵盖所有发生的情况)自然语言)与正则表达式。然而,我发现HamZa的答案是壁橱,也是最有用的。
UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.
所以,小心!接受的答案将不包括所有可能的用例!
推荐答案
基于以前制作正则表达式。
正则表达式基本上是(?< = [。?!])\\ \\ s +(?= [az])
这意味着在。
,之前匹配任何空格一次或多次。
或!
,然后是 [az]
(不要忘记我
修饰符)。
Basing my answer from a previously made regex.
The regex was basically (?<=[.?!])\s+(?=[a-z])
which means match any whitespace one or more times preceded with either .
, ?
or !
and followed by [a-z]
(not forgetting the i
modifier).
现在让我们根据这个问题的需要修改它:
Now let's modify it to the needs of this question:
- 我们首先将其转换为JAVA正则表达式:
(?< = [。?!])\\\\ +(?= [az])
- 我们将添加
i
修饰符以匹配不区分大小写的(? i)(?< = [。?!])\\\\ +(?= [az])
- 我们将把表达式放入一个积极的预测,以防止字符的吃(在这种情况下是分隔符):
(?=(?i)(?< = [。?!])\\\\ +(? = [az]))
- 我们将添加一个负面的lookbehind来检查格式中是否没有缩写
LETTER DOT LETTER DOT
:(?i)(?< = [。?!])(?<![az] \。[az] \。)\ \ + +(?= [az])
- We'll first convert it to a JAVA regex:
(?<=[.?!])\\s+(?=[a-z])
- We'll add the
i
modifier to match case insensitive(?i)(?<=[.?!])\\s+(?=[a-z])
- We'll put the expression in a positive lookahead to prevent the "eating" of the characters (delimiters in this case) :
(?=(?i)(?<=[.?!])\\s+(?=[a-z]))
- We'll add a negative lookbehind to check if there is no abbreviation in the format
LETTER DOT LETTER DOT
:(?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])
所以我们的最终正则表达式如下:将(R)(小于?[?!] =)(?<![AZ] \ [AZ] \。)\\s +(= [AZ])
。
So our final regex looks like : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])
.
部分链接:
- 在线测试人员 ,跳转到 JAVA
- 解释工具 (不基于JAVA)
- 正则表达式教程
- Java特定的正则表达式教程
- SO正则表达式聊天室
- SO
上的一些高级优秀正则表达式
- 这个正则表达式如何找到三角形数字?
- 我们怎样才能将^ nb ^ n与Java正则表达式匹配?
- 这个Java正则表达式如何检测回文?
- 如何确定一个数字是否是正则表达式的素数?
- "垂直" ASCII图像中的正则表达式匹配
- for循环可以从这段PHP代码中消除吗?
^ - 请参阅正则表达式解决方案,虽然不确定如果适用于JAVA
- Online tester, jump to JAVA
- Explain tool (Not JAVA based)
- THE regex tutorial
- Java specific regex tutorial
- SO regex chatroom
- Some advanced nice regex-fu on SO
- How does this regex find triangular numbers?
- How can we match a^n b^n with Java regex?
- How does this Java regex detect palindromes?
- How to determine if a number is a prime with regex?
- "vertical" regex matching in an ASCII "image"
- Can the for loop be eliminated from this piece of PHP code?
^-- See regex solution, although not sure if applicable in JAVA
这篇关于Java - 使用最小大小和分隔符分割标记的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!