正则表达式:什么是InCombiningDiacriticalMarks? [英] Regex: what is InCombiningDiacriticalMarks?

查看:439
本文介绍了正则表达式:什么是InCombiningDiacriticalMarks?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码是众所周知的将重音字符转换为纯文本:

The following code is very well known to convert accented chars into plain Text:

Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

我用这个替换了我的手工制作方法,但我需要理解正则表达式部分替换所有

I replaced my "hand made" method by this one, but i need to understand the "regex" part of the replaceAll

1)什么是InCombiningDiacriticalMarks?

2)文档在哪里? (和类似的?)

1) What is "InCombiningDiacriticalMarks" ?
2) Where is the documentation of it? (and similars?)

谢谢。

推荐答案

\p {InCombiningDiacriticalMarks} 是一个Unicode块属性。在JDK7中,您将能够使用两部分符号 \p {Block = CombiningDiacriticalMarks} 来编写它,这对读者来说可能更清晰。它记录在这里的UAX#44:Unicode字符数据库

\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: "The Unicode Character Database".

这意味着代码点落在特定范围内,一个块,已被分配用于该名称的东西。这是一种糟糕的方法,因为不能保证该范围内的代码点是或者不是任何特定的东西,也不能保证该块外的代码点基本上不是相同的字符。

What it means is that the code point falls within a particular range, a block, that has been allocated to use for the things by that name. This is a bad approach, because there is no guarantee that the code point in that range is or is not any particular thing, nor that code points outside that block are not of essentially the same character.

例如, \p {Latin_1_Supplement} 块中有拉丁字母,如é,U + 00E9。但是,那里也有拉丁字母的东西。当然,到处都有拉丁字母。

For example, there are Latin letters in the \p{Latin_1_Supplement} block, like é, U+00E9. However, there are things that are not Latin letters there, too. And of course there are also Latin letters all over the place.

块几乎不是你想要的。

在这种情况下,我怀疑你可能想要使用属性 \p {Mn} ,又名 \p {Nonspacing_Mark} 。 Combining_Diacriticals块中的所有代码点都是那种。还有(自Unicode 6.0.0起)1087 Nonspacing_Mark在该块中

In this case, I suspect that you may want to use the property \p{Mn}, a.k.a. \p{Nonspacing_Mark}. All the code points in the Combining_Diacriticals block are of that sort. There are also (as of Unicode 6.0.0) 1087 Nonspacing_Marks that are not in that block.

这几乎与检查 \p {Bidi_Class = Nonspacing_Mark} 相同,但不完全相同,因为该组还包括封闭标记, \p {Me} 。如果你想要两者,你可以说 [\p {Mn} \p {Me}] 如果你使用默认的Java正则表达式引擎,因为它只提供访问权限到General_Category属性。

That is almost the same as checking for \p{Bidi_Class=Nonspacing_Mark}, but not quite, because that group also includes the enclosing marks, \p{Me}. If you want both, you could say [\p{Mn}\p{Me}] if you are using a default Java regex engine, since it only gives access to the General_Category property.

你必须使用JNI来获取ICU C ++正则表达式库,就像访问 \p { BC = NSM} ,因为现在只有ICU和Perl可以访问所有 Unicode属性。普通的Java regex库仅支持几个标准的Unicode属性。在JDK7中虽然支持Unicode脚本属性,这对于Block属性来说几乎是无限的。因此,你可以在JDK7中写 \p {Script = Latin} \p {SC = Latin} ,或者快捷 \p {Latin} ,以获取拉丁文字中的任何字符。这导致非常通常需要 [\p {Latin} \p {Common} \p {Inherited}]

You’d have to use JNI to get at the ICU C++ regex library the way Google does in order to access something like \p{BC=NSM}, because right now only ICU and Perl give access to all Unicode properties. The normal Java regex library supports only a couple of standard Unicode properties. In JDK7 though there will be support for the Unicode Script propery, which is just about infinitely preferable to the Block property. Thus you can in JDK7 write \p{Script=Latin} or \p{SC=Latin}, or the short-cut \p{Latin}, to get at any character from the Latin script. This leads to the very commonly needed [\p{Latin}\p{Common}\p{Inherited}].

请注意,这不会删除您可能认为的所有角色的重音标记!有很多人不会这样做。例如,您不能将Đ转换为 D ø转换为 o 。为此,您需要将代码点减少到与Unicode归类表中相同的主要归类强度匹配的代码点。

Be aware that that will not remove what you might think of as "accent" marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.

另一个 \p {Mn} 失败的地方当然是附上<$ c $之类的标记c> \p {Me} ,显然,还有 \p {Diacritic} 字符,这些字符不是标记。遗憾的是,你需要完全的财产支持,这意味着JNI要么是ICU,要么是Perl。 Java有很多支持Unicode的问题,我担心。

Another place where the \p{Mn} thing fails is of course enclosing marks like \p{Me}, obviously, but also there are \p{Diacritic} characters which are not marks. Sadly, you need full property support for that, which means JNI to either ICU or Perl. Java has a lot of issues with Unicode support, I’m afraid.

哦等等,我看你是葡萄牙语。如果您只处理葡萄牙语文本,那么您应该没有任何问题。

Oh wait, I see you are Portuguese. You should have no problems at all then if you only are dealing with Portuguese text.

然而,你真的不想删除重音,我打赌,而是你希望能够匹配不区分重音的东西,对吧?如果是这样,那么您可以使用 ICU4J(ICU) for Java)collat​​or class 。如果比较主要强度,重音符号将不计算在内。我一直这样做,因为我经常处理西班牙文本。我有一个例子,说明如果你需要西班牙语就坐在这里,如何做到这一点。

However, you don’t really want to remove accents, I bet, but rather you want to be able to match things "accent-insensitively", right? If so, then you can do so using the ICU4J (ICU for Java) collator class. If you compare at the primary strength, accent marks won’t count. I do this all the time because I often process Spanish text. I have an example of how to do this for Spanish sitting around here somewhere if you need it.

这篇关于正则表达式:什么是InCombiningDiacriticalMarks?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆