正则表达式:什么是 InCombiningDiacriticalMarks? [英] Regex: what is InCombiningDiacriticalMarks?

查看:20
本文介绍了正则表达式:什么是 InCombiningDiacriticalMarks?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下代码是众所周知的将重音字符转换为纯文本的代码:

Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\p{InCombiningDiacriticalMarks}+", "");

我用这个替换了我的手工制作"方法,但我需要了解 replaceAll 的正则表达式"部分

1) 什么是InCombiningDiacriticalMarks"?
2)它的文档在哪里?(和类似的?)

谢谢.

解决方案

p{InCombiningDiacriticalMarks} 是一个 Unicode 块属性.在 JDK7 中,您将能够使用两部分符号 p{Block=CombiningDiacriticalMarks} 来编写它,这对读者来说可能更清楚.此处记录在 UAX#44:Unicode 字符数据库"中.>

这意味着代码点落在一个特定的范围内,一个块,已被分配用于该名称的事物.这是一个糟糕的方法,因为不能保证该范围内的代码点是或不是任何特定的东西,也不能保证该块之外的代码点本质上不是相同的字符.

例如,p{Latin_1_Supplement} 块中有拉丁字母,如é、U+00E9.然而,那里也有不是拉丁字母的东西.当然还有到处都是拉丁字母.

块几乎从来都不是你想要的.

在这种情况下,我怀疑您可能想要使用属性 p{Mn},又名 p{Nonspacing_Mark}.Combining_Diacriticals 块中的所有代码点都属于这种类型.还有(从 Unicode 6.0.0 开始)1087 个 Nonspacing_Marks not 在该块中.

这与检查 p{Bidi_Class=Nonspacing_Mark} 几乎相同,但不完全相同,因为该组还包括封闭标记,p{Me}.如果两者都需要,如果您使用默认的 Java 正则表达式引擎,则可以说 [p{Mn}p{Me}],因为它只提供对 General_Category 属性的访问权限.

你必须像谷歌那样使用 JNI 来访问 ICU C++ regex 库,以便访问诸如 p{BC=NSM} 之类的东西,因为现在只有 ICU 和 Perl授予访问所有 Unicode 属性的权限.普通的 Java 正则表达式库只支持几个标准的 Unicode 属性.尽管在 JDK7 中支持 Unicode Script 属性,但它几乎比 Block 属性更可取.因此你可以在 JDK7 中写 p{Script=Latin}p{SC=Latin},或者快捷方式 p{Latin},从拉丁脚本中获取任何字符.这导致了非常通常需要的[p{Latin}p{Common}p{Inherited}].

请注意,这不会从所有字符中删除您可能认为的重音"标记!有很多它不会这样做.例如,您不能以这种方式将 Đ 转换为 D 或将 ø 转换为 o.为此,您需要将代码点减少到与 Unicode 排序规则表中相同的主要排序规则强度相匹配的代码点.

p{Mn} 事情失败的另一个地方当然是包含像 p{Me} 这样的标记,显然,但也有 p{Diacritic} 不是标记的字符.遗憾的是,您需要为此提供完整的属性支持,这意味着 JNI 到 ICU 或 Perl.恐怕 Java 在 Unicode 支持方面存在很多问题.

哦等等,我看你是葡萄牙人.如果您只处理葡萄牙语文本,那么您应该没有任何问题.

但是,我敢打赌,您并不是真的想删除重音,而是希望能够不区分重音"来匹配事物,对吗?如果是这样,那么您可以使用 ICU4J (ICU对于 Java) 整理器类.如果您以主要优势进行比较,则重音标记将不计算在内.我一直这样做是因为我经常处理西班牙语文本.如果您需要,我有一个示例说明如何为坐在此处某处的西班牙人执行此操作.

The following code is very well known to convert accented chars into plain Text:

Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\p{InCombiningDiacriticalMarks}+", "");

I replaced my "hand made" method by this one, but i need to understand the "regex" part of the replaceAll

1) What is "InCombiningDiacriticalMarks" ?
2) Where is the documentation of it? (and similars?)

Thanks.

解决方案

p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: "The Unicode Character Database".

What it means is that the code point falls within a particular range, a block, that has been allocated to use for the things by that name. This is a bad approach, because there is no guarantee that the code point in that range is or is not any particular thing, nor that code points outside that block are not of essentially the same character.

For example, there are Latin letters in the p{Latin_1_Supplement} block, like é, U+00E9. However, there are things that are not Latin letters there, too. And of course there are also Latin letters all over the place.

Blocks are nearly never what you want.

In this case, I suspect that you may want to use the property p{Mn}, a.k.a. p{Nonspacing_Mark}. All the code points in the Combining_Diacriticals block are of that sort. There are also (as of Unicode 6.0.0) 1087 Nonspacing_Marks that are not in that block.

That is almost the same as checking for p{Bidi_Class=Nonspacing_Mark}, but not quite, because that group also includes the enclosing marks, p{Me}. If you want both, you could say [p{Mn}p{Me}] if you are using a default Java regex engine, since it only gives access to the General_Category property.

You’d have to use JNI to get at the ICU C++ regex library the way Google does in order to access something like p{BC=NSM}, because right now only ICU and Perl give access to all Unicode properties. The normal Java regex library supports only a couple of standard Unicode properties. In JDK7 though there will be support for the Unicode Script propery, which is just about infinitely preferable to the Block property. Thus you can in JDK7 write p{Script=Latin} or p{SC=Latin}, or the short-cut p{Latin}, to get at any character from the Latin script. This leads to the very commonly needed [p{Latin}p{Common}p{Inherited}].

Be aware that that will not remove what you might think of as "accent" marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.

Another place where the p{Mn} thing fails is of course enclosing marks like p{Me}, obviously, but also there are p{Diacritic} characters which are not marks. Sadly, you need full property support for that, which means JNI to either ICU or Perl. Java has a lot of issues with Unicode support, I’m afraid.

Oh wait, I see you are Portuguese. You should have no problems at all then if you only are dealing with Portuguese text.

However, you don’t really want to remove accents, I bet, but rather you want to be able to match things "accent-insensitively", right? If so, then you can do so using the ICU4J (ICU for Java) collator class. If you compare at the primary strength, accent marks won’t count. I do this all the time because I often process Spanish text. I have an example of how to do this for Spanish sitting around here somewhere if you need it.

这篇关于正则表达式:什么是 InCombiningDiacriticalMarks?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆