带符号的拉丁正则表达式 [英] Latin Regex with symbols

查看:68
本文介绍了带符号的拉丁正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要拆分文本并仅获取单词、数字和带连字符的组合词.我还需要获得拉丁词,然后我使用了 \p{L},它给了我 é、ú ü ã 等等.例子是:

I need split a text and get only words, numbers and hyphenated composed-words. I need to get latin words also, then I used \p{L}, which gives me é, ú ü ã, and so forth. The example is:

String myText = "Some latin text with symbols, ? 987 (A la pointe sud-est de l'île se dresse la cathédrale Notre-Dame qui fut lors de son achèvement en 1330 l'une des plus grandes cathédrales d'occident) : ! @ # $ % ^& * ( ) + - _ #$% "  ' : ; > < / \  | ,  here some is wrong… * + () e -"

Pattern pattern = Pattern.compile("[^\\p{L}+(\\-\\p{L}+)*\\d]+");
String words[] = pattern.split( myText );

这个正则表达式有什么问题?为什么它匹配像 "(", "+", "-", "*" 和 <代码>"|"?

What is wrong with this regex? Why it matches symbols like "(", "+", "-", "*" and "|"?

部分结果如下:

dresse     // OK
sud-est    // OK
occident)  // WRONG
987        // OK
()         // WRONG
(a         // WRONG
*          // WRONG
-          // WRONG
+          // WRONG
(          // WRONG
|          // WRONG

正则表达式的解释是:

[^\p{L}+(\-\p{L}+)*\d]+

 * Word separator will be:
 *     [^  ...  ]  No sequence in:
 *     \p{L}+        Any latin letter
 *     (\-\p{L}+)*   Optionally hyphenated
 *     \d            or numbers
 *     [ ... ]+      once or more.

推荐答案

如果我对您的要求的理解是正确的,则此正则表达式将符合您的要求:

If my understanding of your requirement is correct, this regex will match what you want:

"\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+"

它将匹配:

  • Unicode 拉丁文字字符的连续序列.我将其限制为拉丁文字,因为 \p{L} 将匹配 any 脚本中的字母.如果您的 Java 版本不支持该语法,请将 \\p{IsLatin} 更改为 \\pL.
  • 或几个这样的序列,连字符
  • 或连续的十进制数字序列 (0-9)
  • A contiguous sequence of Unicode Latin script characters. I restrict it to Latin script, since \p{L} will match letter in any script. Change \\p{IsLatin} to \\pL if your version of Java doesn't support the syntax.
  • Or several such sequences, hyphenated
  • Or a contiguous sequence of decimal digits (0-9)

上面的正则表达式是通过调用Pattern.compile来使用,并调用matcher(String input)得到一个Matcher对象,并使用循环查找匹配项.

The regex above is to be used by calling Pattern.compile, and call matcher(String input) to obtain a Matcher object, and use a loop to find matches.

Pattern pattern = Pattern.compile("\\p{IsLatin}+(?:-\\p{IsLatin}+)*|\\d+");
Matcher matcher = pattern.matcher(inputString);

while (matcher.find()) {
    System.out.println(matcher.group());
}

如果您想允许带有撇号 ' 的单词:

If you want to allow words with apostrophe ':

"\\p{IsLatin}+(?:['\\-]\\p{IsLatin}+)*|\\d+"

我还在字符类 ['\\-] 中对 - 进行转义,以防您想添加更多内容.实际上 - 不需要转义,如果它是字符类中的第一个或最后一个,但为了安全起见,我无论如何都将它转义.

I also escape - in the character class ['\\-] just in case you want to add more. Actually - doesn't need escaping if it is the first or last in the character class, but I escape it anyway just to be safe.

这篇关于带符号的拉丁正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆