使用正则表达式检测以重音大写字母开头的单词 [英] Detecting words that start with an accented uppercase using regular expressions

查看:154
本文介绍了使用正则表达式检测以重音大写字母开头的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用Java中的正则表达式提取以大写字母开头的单词 - 包括重音大写字母。

I want to extract the words that begin with a capital — including accented capitals — using regular expressions in Java.

这是以大写字母A开头的单词的条件通过Z:

This is my conditional for words beginning with capital A through Z:

if (link.text().matches("^[A-Z].+") == true) 

但我也想要以重音大写字母开头的单词。

But I also want words that begin with an accented uppercase character, too.

你有什么想法吗?

推荐答案

匹配一个在字符串开头的大写字母,您需要模式 ^ \p {Lu}

To match an uppercase letter at the beginning of the string, you need the pattern ^\p{Lu}.

不幸的是,Java不支持强制 \p {大写} 属性,这是满足 UTS#18的RL1.2

Unfortunately, Java does not support the mandatory \p{Uppercase} property, necessary for meeting UTS#18’s RL1.2.

这几乎不是Java正则表达式中唯一能够满足甚至是最简单的基本Unicode功能的Level 1。如果没有Level 1,您实际上无法使用正则表达式进行Unicode测试。太多被打破或缺席。

That’s hardly the only thing missing from Java regular expressions to meet even Level 1, the most bareboned Basic Unicode Functionality. Without Level 1, you really can’t work with Unicode test using regular expressions. Too much is broken or absent.

UTS#18的RL1.1将最终与JDK7相遇,但我不相信目前有任何计划,以满足RL1.2,RL1.2a,或任何其他目前缺乏的,甚至不符合两个强有力的建议。唉!

UTS#18’s RL1.1 will finally be met with JDK7, but I do not believe there are currently any plans to meet RL1.2, RL1.2a, or any of the others that it’s currently lacking, nor even meeting the two Strong Recommendations. Alas!

事实上,在RL1.2所要求的非常短的强制属性列表中,Java缺少 \p {Alphabetic} \p {大写} \p {小写} \p {White_Space} \p {Noncharacter_Code_Point} \p {Default_Ignorable_Code_Point} \p {ANY} \p {ASSIGNED} 属性。这些都是强制性的,但要么完全缺失,要么就其定义而言不遵守Unicode标准。这也是Java中POSIX兼容属性的问题:它们在UTS#18方面都被打破了。

Indeed, of the very short list of mandatory properties required by RL1.2, Java is missing the \p{Alphabetic}, \p{Uppercase}, \p{Lowercase}, \p{White_Space}, \p{Noncharacter_Code_Point}, \p{Default_Ignorable_Code_Point}, \p{ANY}, and \p{ASSIGNED} properties. Those are all mandatory but either completely missing or else fail to obey The Unicode Standard with respect to their definitions. This is also the problem with the POSIX compatible properties in Java: they’re all broken with respect to UTS#18.

在JDK7之前,它也缺少强制性脚本属性。 JDK7确实获得了最后的脚本属性,但这就是全部 - 没有别的。 Java甚至还远远没有达到RL1.2a,这对于数以万计的程序员来说是日常难题。

Prior to JDK7, it is also missing the mandatory Script properties. JDK7 does get script properties at long last, but that’s all — nothing else. Java is still light years away from meeting even RL1.2a, which is a daily gotcha for zillions of programmers.

在JDK7中,您最终还可以使用 \p {name = value} 形式的两部分属性如果它们是块,脚本或一般类别。这意味着这些在JDK7的Pattern类中都是相同的:

In JDK7, you can finally also two-part properties in the form \p{name=value} if they’re block, script, or general categories. That means these are all the same in JDK7’s Pattern class:


  • \p {Block = Number_Forms} \p {blk = Number_Forms} \p {InNumber_Forms}

  • \p {Script = Latin} \p {sc = Latin} \p {IsLatin} \p {Latin}

  • \p {General_Category = Lu} \p {GC = Lu} ,和 \p {Lu}

  • \p{Block=Number_Forms}, \p{blk=Number_Forms}, and \p{InNumber_Forms}.
  • \p{Script=Latin}, \p{sc=Latin}, \p{IsLatin}, and \p{Latin}.
  • \p{General_Category=Lu}, \p{GC=Lu}, and \p{Lu}.

但是,你仍然无法使用长形式如 \p {Lowercase_Letter} \p {Letter_Number} ,以及POSIX-looking从RL1.2a的角度来看,属性都被打破了。 RL1.2的超级基本属性如\p {White_Space}和\p {Alphabetic}仍然缺失。

However, you still cannot use the the long forms like \p{Lowercase_Letter} and \p{Letter_Number}, and the POSIX-looking properties are all broken from RL1.2a’s perspective. Plus super-basic properties from RL1.2 like \p{White_Space} and \p{Alphabetic} are still missing.

有一些关于尝试修复 \b \B <的讨论/ code>,相对于 \w \W 而言,这是非常糟糕的,但我不知道他们如何在没有完全遵守RL1.2a的情况下解决所有这些问题。不,我不知道他们什么时候会将这些基本属性添加到Java。你也不能没有它们。

There was some talk of trying to fix \b and \B, which are miserably broken with respect to \w and \W, but I don't know how they’re going to fix all that without fully complying with RL1.2a. And no, I have no idea when they will add those basic properties to Java. You can’t get by without them, either.

要在Java中使用正则级别1的正则表达式完全使用Unicode,你真的不能使用Java的标准Pattern类附带。最简单的方法是使用JNI使用谷歌Android代码连接ICU正则表达式库, 可用。
确实存在与UTS#18至少符合Level-1(或更好)的其他语言,但如果你想留在Java中,ICU目前是你自己的真正选择。

To fully work with Unicode using regexes in Java at even Level 1, you really cannot use the standard Pattern class that Java comes with. The easiest way to do so is to instead use JNI to connect up with ICU regex libraries using the Google Android code, which is available. There do exist other languages that are at least Level-1 compliant (or better) with UTS#18, but if you want to stay within Java, ICU is currently your own real option.

这篇关于使用正则表达式检测以重音大写字母开头的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆