Java - 使用最小大小和分隔符分割标记的正则表达式 [英] Java - Regex to Split Tokens With Minimum Size and Delimiters

查看:236
本文介绍了Java - 使用最小大小和分隔符分割标记的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道我知道,有许多类似的问题,我可以说我读了所有这些问题。但是,我在正则表达式方面并不擅长,我无法弄清楚我需要的正则表达式。

I know I know, there are many similar questions, and I can say I read all of them. But, I am not good in regex and I couldn't figure out the regular expression that I need.

我想在Java中拆分一个String,我有4个约束:

I want to split a String in Java, and I have 4 constraints:


  1. 分隔符是[。?!](句末)

  2. 十进制数字不应被标记化

  3. 不应删除分隔符。

  4. 每个标记的最小大小应为5

  1. The delimiters are [.?!] (end of the sentence)
  2. Decimal numbers shouldn't be tokenized
  3. The delimiters shouldn't be removed.
  4. The minimum size of each token should be 5

例如,输入:

Hello World!This回答价值1.45美元的美元。谢谢。

输出将是:

[Hello World !,这个答案价值1.45美元美元。谢谢。]

Up到目前为止,我得到了这个正则表达式的三个第一个约束的答案:

Up to now I got the answer for three first constraints by this regex:

text.split("(?<=[.!?])(?<!\\d)(?!\\d)");

我知道我应该使用 {5,} 在我的正则表达式的某个地方,但我试过的任何组合都不起作用。

And I know I should use {5,} somewhere in my regex, but any combination that I tried doesn't work.

对于以下情况:我爱你美国怎么样?如果它给我一个或两个句子并不重要,只要它没有将 S。标记为单独的句子。

For cases like: "I love U.S. How about you?" it doesn't matter if it gives me one or two sentences, as far as it doesn't tokenize S. as a separate sentence.

最后,介绍一个很好的正则表达式教程。

Finally, introducing a good tutorial of regex is appreciated.

更新:正如评论中提到的 Chris ,几乎不可能解决这样的问题(涵盖所有发生的情况)自然语言)与正则表达式。然而,我发现HamZa的答案是壁橱,也是最有用的。

UPDATE: As Chris mentioned in the comments, it is almost impossible to solve questions like this (to cover all the cases happen in natural languages) with regex. However, I found HamZa's answer the closet, and the most useful one.

所以,小心!接受的答案将不包括所有可能的用例!

推荐答案

基于以前制作正则表达式

正则表达式基本上是(?< = [。?!])\\ \\ s +(?= [az])这意味着在之前匹配任何空格一次或多次。 ,然后是 [az] (不要忘记修饰符)。

Basing my answer from a previously made regex.
The regex was basically (?<=[.?!])\s+(?=[a-z]) which means match any whitespace one or more times preceded with either ., ? or ! and followed by [a-z] (not forgetting the i modifier).

现在让我们根据这个问题的需要修改它:

Now let's modify it to the needs of this question:


  1. 我们首先将其转换为JAVA正则表达式:(?< = [。?!])\\\\ +(?= [az])

  2. 我们将添加 i 修饰符以匹配不区分大小写的(? i)(?< = [。?!])\\\\ +(?= [az])

  3. 我们将把表达式放入一个积极的预测,以防止字符的吃(在这种情况下是分隔符):(?=(?i)(?< = [。?!])\\\\ +(? = [az]))

  4. 我们将添加一个负面的lookbehind来检查格式中是否没有缩写 LETTER DOT LETTER DOT (?i)(?< = [。?!])(?<![az] \。[az] \。)\ \ + +(?= [az])

  1. We'll first convert it to a JAVA regex: (?<=[.?!])\\s+(?=[a-z])
  2. We'll add the i modifier to match case insensitive (?i)(?<=[.?!])\\s+(?=[a-z])
  3. We'll put the expression in a positive lookahead to prevent the "eating" of the characters (delimiters in this case) : (?=(?i)(?<=[.?!])\\s+(?=[a-z]))
  4. We'll add a negative lookbehind to check if there is no abbreviation in the format LETTER DOT LETTER DOT : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z])

所以我们的最终正则表达式如下:将(R)(小于?[?!] =)(?<![AZ] \ [AZ] \。)\\s +(= [AZ])

So our final regex looks like : (?i)(?<=[.?!])(?<![a-z]\.[a-z]\.)\\s+(?=[a-z]).

部分链接:

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆