Java Regex匹配越南字符 [英] Java Regex to match Vietnamese chars

查看:420
本文介绍了Java Regex匹配越南字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须写一个正则表达式来限制输入字段,只允许越南字符,英文字符和数字。我知道如何限制英语字符( [a-zA-Z] )和数字( [0-9] ),但不知道如何限制越南字符。

I have to write a regex to restrict a input field, that only allow Vietnamese chars, English chars and digits. I know how to restrict English chars ([a-zA-Z]) and digits ([0-9]), but don't know how to restrict Vietnamese chars.

任何人都可以给我一个java正则表达式匹配越南字符吗?

Can anyone give me a java regex to match Vietnamese chars?

越南人的角色是:ể,ứ(编辑:但我不知道所有这些。否则,我可以使用 [a-list-of-chars] ,或者可能有一个范围,例如 [ad] 而不是 [abcd]

Vietnamese chars are Like: ể, ứ ( but i don't know all of them. Otherwise, i can use[a-list-of-chars], or maybe there is a range, like [a-d] instead of [abcd])

推荐答案

越南字母



的交叉点越南字母和英文字母(即2个字母之间的常见内容)是英文字母减去 f j w z

Vietnamese Alphabet

The intersection of the Vietnamese alphabet and English alphabet (i.e. whatever that is common between the 2 alphabets) is the alphabet of English minus f, j, w and z.

在越南语中, a e i o u y 被视为元音。

In Vietnamese, a, e, i, o, u, y are considered vowels.

除此之外,越南人还使用其他几个特征与变音符号的rs。下面列出了字符的大写字母(小写字母有1个字符到1个字符的映射,与德语中的ß不同):

Apart from those, Vietnamese also use several other characters with diacritics. Below list the uppercase of the character (the lowercase version has a 1-character-to-1-character mapping, unlike ß in German):


  • 辅音:

  • Consonant:

Đ: LATIN CAPITAL LETTER D WITH STROKE


  • 元音:

  • Vowels:

    Ă: LATIN CAPITAL LETTER A WITH BREVE
    Â: LATIN CAPITAL LETTER A WITH CIRCUMFLEX
    Ê: LATIN CAPITAL LETTER E WITH CIRCUMFLE
    Ô: LATIN CAPITAL LETTER O WITH CIRCUMFLEX
    Ơ: LATIN CAPITAL LETTER O WITH HORN
    Ư: LATIN CAPITAL LETTER U WITH HORN
    


  • 越南语有6个音调,除了第一个音调,其他5个音调由元音上的另一个变音符指示。音调变音符号是急性的á,严重à,钩子,tilde ã并点在以下。由于有(6 + 6)个元音乘以5个音调的变音符号,加上6个已经在第一个音调上带有变音符号的元音,因此有66个带有变音符号的元音字形:

    Vietnamese has 6 tones, except for the first tone, the other 5 tones are indicated by another diacritic on the vowels. The tonal diacritics are acute á, grave à, hook , tilde ã and dot below . Since there are (6 + 6) vowels times 5 tones with diacritics, plus 6 vowels already with diacritic on the first tone, there are 66 glyphs of vowels with diacritic(s):

    以下是所有(67)带有变音符号的辅音和元音的列表:

    Here is the list of all (67) consonants and vowels with diacritic(s):

      Á À Ã Ả Ạ
    Ă Ắ Ằ Ẳ Ẵ Ặ
    Â Ấ Ầ Ẩ Ẫ Ậ
    Đ
      É È Ẻ Ẽ Ẹ
    Ê Ế Ề Ể Ễ Ệ 
      Í Ì Ỉ Ĩ Ị
    Ô Ố Ồ Ổ Ỗ Ộ
    Ơ Ớ Ờ Ở Ỡ Ợ
      Ó Ò Õ Ỏ Ọ
    Ư Ứ Ừ Ử Ữ Ự
      Ú Ù Ủ Ũ Ụ
      Ý Ỳ Ỷ Ỹ Ỵ
    

    这些字符分布在Unicode中的不同拉丁语块中。我从字符图中手工挑选了这些字符,我必须小心不要选择与上面的字符在视觉上相同的字符。可以肯定的是,我们可以打印字符的名称并检查它们是拉丁字符而不是希腊语或西里尔语。

    These characters spread across different Latin blocks in Unicode. I handpicked these characters from Character Map, and I had to be careful not to pick characters which are visually identical to the character above. To be sure, we can print the names of the characters and check that they are Latin character rather than Greek or Cyrillic.

    String VIETNAMESE_DIACRITIC_CHARACTERS = "ẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ";
    
    for (char c: VIETNAMESE_DIACRITIC_CHARACTERS.toCharArray()) {
        System.out.println(c + ": " + Character.getName(c));
    }
    



    组合字符



    越南语输入法如Unikey有两种模式:单一代码点模式(Unicodedựngsẵn)和组合标记模式(Unicodetổhợp)。

    Combining character

    Vietnamese Input methods such as Unikey has 2 modes: single code point mode ("Unicode dựng sẵn"), and combining mark mode ("Unicode tổ hợp").

    例如,对于相同的字符(U + 1EE3),有几种方法可以指定它:

    As an example, for the same character (U+1EE3), there can be several ways to specify it:


    • 作为单个代码点(1个代码点):

    • 作为<$的组合c $ c>ơ(U + 01A1)并在下面合并点(U + 0323)(2个代码点):

    • 作为 o 的组合,组合钩子(U + 031B),并将下面的点组合(U + 0323)(3个代码点):

    • As single code point (1 code point):
    • As a combination of ơ (U+01A1) and combining dot below (U+0323) (2 code points): ợ
    • As a combination of o, combining hook (U+031B), and combining dot below (U+0323) (3 code points): ợ

    您可以将这些字符复制到浏览器的控制台中并检查其长度:

    You can copy these character into the console of the browser and check its length:

    ["ợ","ợ","ợ"].forEach(function (e) {console.log(e.length);})
    

    如果你想匹配所有这3个变种abo ve,你必须列出所有可能的组合和排列来指定字符,你必须对上面列出的变音符号的所有字符执行此操作,都是大写的和/或小写。

    If you want to match all those 3 variations above, you must list all possible combinations and permutations to specify the character, and you would have to do this for all the characters with diacritics as listed above, and in both uppercase and lowercase.

    足够简单?

    即使您回答是,您的代码也会成为一个难以维护的混乱没有人能理解。

    Even if you answer yes, your code will become an unmaintainable mess that no one can understand.

    由于有多种方法可以指定相同的文本,而没有任何转换,它是无法比较相等。

    Since there are more than one ways to specify the same text , without any transformation, it is not possible to compare and ợ to be equal.

    "ợ".equals("ợ") --> false
    

    Unicode标准因此定义所有3种方式来指定以上为规范等效,并且还定义了规范化字符串以进行比较的方法。

    Unicode Standard therefore define all 3 ways to specify above as canonically equivalent, and also define methods to normalize a string for comparison purpose.

    Pattern 类的参考实现(由Oracle,在Windows和其他平台上广泛使用)具有(部分)支持使用 Pattern.CANON_EQ 模式进行规范等价匹配。如这个错误报告。在撰写本文时,由于 CANON_EQ 被支持,因此所有版本都存在此错误,并且不太可能很快修复。但是,它并没有完全破坏,我们仍然可以使用该选项目前提供的任何东西。

    The reference implementation of Pattern class (by Oracle, widely used on Windows and other platforms) has (partial) support for canonical equivalence matching using Pattern.CANON_EQ mode. It is extremely buggy to the point of unusable as seen in this and this bug report. At the time of writing, this bug has been there on all version since CANON_EQ is "supported", and it is not likely to be fixed any time soon. However, it is not totally broken, and we can still make use of whatever currently offered by the option.

    以下是匹配越南语+英语字母的模式的构造,:

    Below is the construction of the Pattern for matching Vietnamese + English alphabet, :

    String VIETNAMESE_DIACRITIC_CHARACTERS 
            = "ẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ";
    
    Pattern p =
        Pattern.compile("(?:[" + VIETNAMESE_DIACRITIC_CHARACTERS + "]|[A-Z])++",
                        Pattern.CANON_EQ |
                        Pattern.CASE_INSENSITIVE |
                        Pattern.UNICODE_CASE);
    

    附加标志 Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE 用于使模式匹配所有Unicode字符不区分大小写。单独使用 Pattern.CASE_INSENSITIVE 只会使模式与US-ASCII字符集中的字符不区分大小写。

    The additional flags Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE are used to make the pattern matches case-insensitively for all Unicode characters. Pattern.CASE_INSENSITIVE alone only makes the pattern matches case-insensitively for characters in US-ASCII charset.

    注意 VIETNAMESE_DIACRITIC_CHARACTERS 中字符的顺序非常重要。我不建议更改字符的顺序,除非您理解其含义。

    Note that the order of the characters in VIETNAMESE_DIACRITIC_CHARACTERS is significant. I don't recommend changing the order of the characters unless you understand the implication.

    输入应使用Canonical Decomposition(NKD)或Canonical Composition(使用Canonical Composition)进行标准化( NKC)在匹配之前执行。它确保组合标记符合规范。

    The input should be normalized with Canonical Decomposition (NKD) or Canonical Composition (NKC) before matching is performed on it. It ensures that the combining marks are in a canonical order.

    无论输入是否使用Canonical Composition进行预处理或Canonical Decomposition,结果看起来相同。运行附录中的代码应该为第二个和第三个输出返回视觉上相同的结果:

    Regardless of whether the input is preprocessed with Canonical Composition or Canonical Decomposition, the result looks the same. Running the code in the appendix should return visually identical result for the second and the third output:


    BạnchínhlàtácgiảcủaWikipediaMọngngườiđềucóthểbiêntậpbàitngaylậptứcchỉcầnnhớvàiquytắcCósẵnrấtnềềtrangtrợgiúpnhưtạobàsiửabàsihattảiảnhBạncũngđừngngạiđặtcâuhỏi HIEN涌TACóBAI越南VA清维恩

    BANchínhLàTAC GIA CUA维基百科MOIngườiđều共同开发的边tậpbàigālứptứcchỉcầnnhớvàciyắtCcCósẵnrấtnhiềutrangtrợgiúpnhưtạobàsiửabàiihattảiảnhBạngũngđừngngặtcâuhỏiHiệnchúngtacóbàuviếtvàthànhviên



    尝试失败



    这里有一些错误d次尝试,将用于解释为什么正则表达式的构造如上所示。

    Failed attempts

    Here are some failed attempts, which will be used to explain why the regex is constructed as shown above.

    尝试1

    String VIETNAMESE_DIACRITIC_CHARACTERS 
            = "ẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ";
    
    Pattern p =
        Pattern.compile("[A-Z" + VIETNAMESE_DIACRITIC_CHARACTERS + "]++",
                        Pattern.CANON_EQ |
                        Pattern.CASE_INSENSITIVE |
                        Pattern.UNICODE_CASE);
    

    为什么我们不包含 AZ 进入单个字符类而不是将它放在一个单独的字符类中并与变音字符类交替?

    Why don't we include A-Z into a single character class instead of putting it in a separate character class and alternate with the diacritic character class?

    不,结果是打破的我们尝试匹配输入字符串的规范分解。变音符号根本不匹配。

    Nope, the result is broken when we try to match on the Canonical Decomposition of the input string. The diacritics are not matched at all.


    Ba n chi nh la ta c gia cu a Wikipedia Mo i ngu oi u co the the the c c c c c c c c c n cungđung nga iđat ca u ho i Hie n chu ng ta co ba v v v t tha nh vie n



    'p>的尝试2

    String VIETNAMESE_DIACRITIC_CHARACTERS 
            = "ÁÀÃẢẠĂẮẰẲẴẶÂẤẦẨẪẬĐÉÈẺẼẸÊẾỀỂỄỆÍÌỈĨỊÓÒÕỎỌÔỐỒỔỖỘƠỚỜỞỠỢÚÙỦŨỤƯỨỪỬỮỰÝỲỶỸỴ";
    
    Pattern p =
        Pattern.compile("(?:[" + VIETNAMESE_DIACRITIC_CHARACTERS + "]|[A-Z])++",
                        Pattern.CANON_EQ |
                        Pattern.CASE_INSENSITIVE |
                        Pattern.UNICODE_CASE);
    

    变音字符在字符类中声明,因此代码应该表现相同我更改了角色的顺序......对吗?

    不,当我们尝试匹配输入字符串的规范分解时,某些结果会被破坏

    Nope, some results are broken when we try to match on the Canonical Decomposition of the input string.

    BANchínhLàTAC GIA CUA维基百科MOIngươ我DJEü共同开发的边TA p BAI戈·恩盖LA p tưcchỉcnnhơvàcioutăcCósănrânthiêttrangtrơgiúcphưtạobàsiưàbàishattảiảnhBạngũngđưngngạiđạtcâuhỏiHiẹnchúngtacóbàiiiêtvàthànhviên



    解释



    参考实现(Oracle )通过挑选表达式中的字符来实现 Pattern.CANON_EQ 模式,该字符可以在Canonical Decomposition和pe下扩展为多个字符改革正则表达式的文本转换。然后,表达式将按照正常情况进行编译。

    Explanation

    The reference implementation (Oracle) implements Pattern.CANON_EQ mode by picking out characters in the expression which can be expanded into multiple characters under Canonical Decomposition and perform a textual transformation of the regex. Then, the expression will be compiled as per normal.

    转换正则表达式的第一个过程不能正确解析表达式,因此它表现出非常简单匹配的疯狂行为正如上面的bug报告所示。

    The first pass to transform the regex doesn't parse the expression properly, so it exhibits crazy behavior for very simple matching as seen in the bug reports above.

    幸运的是,模式类在转换后吐出正则表达式正则表达式中无法匹配的。因此,我们可以在最后添加)来触发 PatternSyntaxException 并查看转换的正则表达式字符串。

    Fortunately, Pattern class spits out the regex after the transformation if there is an unmatched ( in the regex. Therefore, we can just add ( at the end to trigger PatternSyntaxException and look at the transformed regex string.

    让我们弄乱上面的解决方案正则表达式,看看进入编译的正则表达式字符串是什么步骤:

    Let's mess up the solution regex above and see what is the regex string that enters the compilation step:

    java.util.regex.PatternSyntaxException: Unclosed group near index 596
    (?:(?:[Đ]|Ắ|Ắ|Ắ|Ằ|Ằ|Ằ|Ẳ|Ẳ|Ẳ|Ẵ|Ẵ|Ẵ|Ặ|Ặ|Ặ|Ặ|Ặ|Ặ|Ă|Ă|Ấ|Ấ|Ấ|Ầ|Ầ|Ầ|Ẩ|Ẩ|Ẩ|Ẫ|Ẫ|Ẫ|Ậ|Ậ|Ậ|Ậ|Ậ|Ậ|Â|Â|Á|Á|À|À|Ã|Ã|Ả|Ả|Ạ|Ạ|Ế|Ế|Ế|Ề|Ề|Ề|Ể|Ể|Ể|Ễ|Ễ|Ễ|Ệ|Ệ|Ệ|Ệ|Ệ|Ệ|Ê|Ê|É|É|È|È|Ẻ|Ẻ|Ẽ|Ẽ|Ẹ|Ẹ|Í|Í|Ì|Ì|Ỉ|Ỉ|Ĩ|Ĩ|Ị|Ị|Ố|Ố|Ố|Ồ|Ồ|Ồ|Ổ|Ổ|Ổ|Ỗ|Ỗ|Ỗ|Ộ|Ộ|Ộ|Ộ|Ộ|Ộ|Ô|Ô|Ớ|Ớ|Ớ|Ớ|Ớ|Ớ|Ờ|Ờ|Ờ|Ờ|Ờ|Ờ|Ở|Ở|Ở|Ở|Ở|Ở|Ỡ|Ỡ|Ỡ|Ỡ|Ỡ|Ỡ|Ợ|Ợ|Ợ|Ợ|Ợ|Ợ|Ơ|Ơ|Ó|Ó|Ò|Ò|Õ|Õ|Ỏ|Ỏ|Ọ|Ọ|Ứ|Ứ|Ứ|Ứ|Ứ|Ứ|Ừ|Ừ|Ừ|Ừ|Ừ|Ừ|Ử|Ử|Ử|Ử|Ử|Ử|Ữ|Ữ|Ữ|Ữ|Ữ|Ữ|Ự|Ự|Ự|Ự|Ự|Ự|Ư|Ư|Ú|Ú|Ù|Ù|Ủ|Ủ|Ũ|Ũ|Ụ|Ụ|Ý|Ý|Ỳ|Ỳ|Ỷ|Ỷ|Ỹ|Ỹ|Ỵ|Ỵ)|[A-Z])++(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ^
    

    正如我们所看到的,引擎会抓取所有可以在Canonical Decomposition下扩展的字符,接受它在字符类之外并构建一个替换。

    As we can see, the engine grab all the characters which can expand under Canonical Decomposition, take it outside the character class and build an alternation.

    对于在交替中重复的相同字符发生了什么仍然不是很清楚,所以我将在每个字符之间插入空格:

    It is still not very clear what is happening with the same characters repeating in alternation, so I will insert space between every character:


    (? :(?:[Đ] | Ă|Ă|Ắ|Ă|Ă|Ằ| Ẳ|Ẳ|Ẳ|Ă|Ă|Ẵ| Ặ|Ặ |Ặ| Ặ|Ặ|Ặ|Ă|Ă| A |Â|Ấ| A |Â|Ầ| Ả|̉| |Ẩ| A |Â |Ẫ|Ạ|Ạ|Ậ|Ạ|̣| |Ậ| A |Â| A |Á| A |À| A |Ã|Ả|Ả|Ạ|Ạ | E |Ê|Ế| E |Ê|Ề| E | |Ể|Ể| E |Ê|Ễ|Ẹ|Ẹ|Ệ|Ẹ| Ệ|Ệ| E |Ê| E |É| E |È| Ẻ|Ẻ| E |Ẽ| Ẹ|Ẹ| I |Í| I |Ì|̉|Ỉ| I |Ĩ|Ị|Ị| O |Ô|Ố| O |Ô|Ồ| O | |Ổ|Ổ| O |Ô|Ỗ|Ọ| Ọ|Ộ|Ọ|Ộ|Ộ| O |Ô|Ơ|Ơ|Ớ|Ơ|Ớ|Ớ|Ơ|Ơ|Ờ|Ơ| Ờ|Ờ| Ở|Ở|Ở| Ở|Ở|Ở| Ơ|Ơ|Ỡ| Ơ|Ỡ|Ỡ| Ợ|Ợ|Ợ| Ợ|Ợ|Ợ| Ơ|Ơ| O |Ó| O |Ò| O |Õ| O | | Ỏ|Ọ|Ọ|Ư|Ư|Ứ|Ư|Ứ|Ứ| Ư|Ư|Ừ| Ư|Ừ|Ừ| Ử|Ử| Ử| Ử|Ử|Ử| Ư|Ư|Ữ| ̛| ŨŨ| Ữ| Ự| ƯƯ| Ự| Ự| ỤỤ| Ự| ̛| Ư| U | Ú| U | Ù| ̉| Ủ| U | Ũ| ̣| Ụ| Y | Ý| Y | Ỳ| Ỷ| Ỷ| Y | Ỹ| Ỵ| Ỵ)| [A - Z])+ +(

    我们可以看到一堆相同的字符重复是并不完全相同 - 它们是表示相同字符的不同序列。

    We can see that the bunch of same character repeating is not really the same - they are different sequences to represent the same character.

    使用相同的方法,让我们分析尝试2中的正则表达式,看看它失败的原因。 / p>

    With the same method, let us analyze the regex in attempt 2 to see why it fails.

    java.util.regex.PatternSyntaxException: Unclosed group near index 596
    (?:(?:[Đ]|Á|Á|À|À|Ã|Ã|Ả|Ả|Ạ|Ạ|Ă|Ă|Ắ|Ắ|Ắ|Ằ|Ằ|Ằ|Ẳ|Ẳ|Ẳ|Ẵ|Ẵ|Ẵ|Ặ|Ặ|Ặ|Ặ|Ặ|Ặ|Â|Â|Ấ|Ấ|Ấ|Ầ|Ầ|Ầ|Ẩ|Ẩ|Ẩ|Ẫ|Ẫ|Ẫ|Ậ|Ậ|Ậ|Ậ|Ậ|Ậ|É|É|È|È|Ẻ|Ẻ|Ẽ|Ẽ|Ẹ|Ẹ|Ê|Ê|Ế|Ế|Ế|Ề|Ề|Ề|Ể|Ể|Ể|Ễ|Ễ|Ễ|Ệ|Ệ|Ệ|Ệ|Ệ|Ệ|Í|Í|Ì|Ì|Ỉ|Ỉ|Ĩ|Ĩ|Ị|Ị|Ó|Ó|Ò|Ò|Õ|Õ|Ỏ|Ỏ|Ọ|Ọ|Ô|Ô|Ố|Ố|Ố|Ồ|Ồ|Ồ|Ổ|Ổ|Ổ|Ỗ|Ỗ|Ỗ|Ộ|Ộ|Ộ|Ộ|Ộ|Ộ|Ơ|Ơ|Ớ|Ớ|Ớ|Ớ|Ớ|Ớ|Ờ|Ờ|Ờ|Ờ|Ờ|Ờ|Ở|Ở|Ở|Ở|Ở|Ở|Ỡ|Ỡ|Ỡ|Ỡ|Ỡ|Ỡ|Ợ|Ợ|Ợ|Ợ|Ợ|Ợ|Ú|Ú|Ù|Ù|Ủ|Ủ|Ũ|Ũ|Ụ|Ụ|Ư|Ư|Ứ|Ứ|Ứ|Ứ|Ứ|Ứ|Ừ|Ừ|Ừ|Ừ|Ừ|Ừ|Ử|Ử|Ử|Ử|Ử|Ử|Ữ|Ữ|Ữ|Ữ|Ữ|Ữ|Ự|Ự|Ự|Ự|Ự|Ự|Ý|Ý|Ỳ|Ỳ|Ỷ|Ỷ|Ỹ|Ỹ|Ỵ|Ỵ)|[A-Z])++(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        ^
    

    In每个角色之间的空间:

    Insert space between every character:


    (? :(?:[Đ] | A |Á| A |À| A |Ã|Ả|Ả|Ạ|Ạ|Ă|Ă|Ă|Ă|Ắ|Ă |Ă|Ằ| Ẳ|Ẳ|Ẳ|Ă|Ă|Ẵ| Ặ|Ặ|Ặ| Ặ|Ặ|Ặ| A |Â| A |Â|Ấ| A |Â|Ầ|Ả|̉| |Ẩ| A |Â|Ẫ|Ạ|Ạ|Ậ|Ạ|̣| |Ậ | E |É| E |È|Ẻ|Ẻ| E |Ẽ|Ẹ|Ẹ| E |Ê| E |Ê|Ế| E |Ê|Ề| E ̉|Ể|Ể| E |Ê|Ễ|Ẹ|Ẹ|Ệ|Ẹ|Ệ|Ệ| I |Í| I |Ì| Ỉ|Ỉ| I |Ĩ|Ị|Ị| O |Ó| O |Ò| O |Õ|Ỏ|Ỏ| O | |Ọ| O |Ô| O |Ô|Ố| O |Ô|Ồ|Ỏ|Ổ|Ổ| O |Ô|Ỗ|Ọ|Ọ|Ộ|Ọ|Ộ|Ộ|Ơ|Ơ| O ̛|Ơ|Ớ| O | |Ó| |Ớ| O | |Ơ|Ờ| O | |Ờ|Ờ | Ở|Ở|Ở| Ở|Ở|Ở| Ơ|Ơ|Ỡ| Ơ|Ỡ|Ỡ| Ợ|Ợ|Ợ| Ợ| Ợ|Ợ| U |Ú| U |Ù| Ủ|Ủ| U |Ũ| Ụ|Ụ| Ư|Ư| Ư|Ư|Ứ| Ư|Ứ |Ứ| U | |Ư|Ừ|Ư|ÙÙ| Ừ| Ử| ƯƯ| Ử| Ử| ỦỦ| Ử| ̛| Ư| Ữ| ̛| ŨŨ| Ữ| Ự| ƯƯ| Ự| Ự| ỤỤ| Ự| Y | Ý| Y | Ỳ| Ỷ| Ỷ| Y | Ỹ| Ỵ| Ỵ)| [A - Z])+ +(

    请注意 A | Â在正则表达式中 A |Â|Ầ之前。这意味着 A 将首先在输入 A )上尝试,重复将在结束时结束在下一次迭代中无法匹配任何内容。

    Notice that A ̂ | Â comes before A ̂ ̀ | Â ̀ | Ầ in the regex. This means that A ̂ will be tried first on the input Ầ (A ̂ ̀), and the repetition will end when it fails to match anything in the next iteration.

    由于交替的顺序很重要,一般来说,在2个字符串之间,其中一个字符串是另一个字符串的前缀,更长的字符串应该在交替中首先出现。在我们的例子中,我们需要在字符之前放置具有更多变音符号的字符,使用较少或没有变音符号。

    Since the order of the alternation is important, as a general rule, between 2 strings where one string is a prefix of the other, the longer string should go first in the alternation. In our case, we need to place the characters with more diacritics before the character with less or without diacritics.

    同样的问题尝试1:

    java.util.regex.PatternSyntaxException: Unclosed group near index 589
    (?:[A-ZĐ]|Ắ|Ắ|Ắ|Ằ|Ằ|Ằ|Ẳ|Ẳ|Ẳ|Ẵ|Ẵ|Ẵ|Ặ|Ặ|Ặ|Ặ|Ặ|Ặ|Ă|Ă|Ấ|Ấ|Ấ|Ầ|Ầ|Ầ|Ẩ|Ẩ|Ẩ|Ẫ|Ẫ|Ẫ|Ậ|Ậ|Ậ|Ậ|Ậ|Ậ|Â|Â|Á|Á|À|À|Ã|Ã|Ả|Ả|Ạ|Ạ|Ế|Ế|Ế|Ề|Ề|Ề|Ể|Ể|Ể|Ễ|Ễ|Ễ|Ệ|Ệ|Ệ|Ệ|Ệ|Ệ|Ê|Ê|É|É|È|È|Ẻ|Ẻ|Ẽ|Ẽ|Ẹ|Ẹ|Í|Í|Ì|Ì|Ỉ|Ỉ|Ĩ|Ĩ|Ị|Ị|Ố|Ố|Ố|Ồ|Ồ|Ồ|Ổ|Ổ|Ổ|Ỗ|Ỗ|Ỗ|Ộ|Ộ|Ộ|Ộ|Ộ|Ộ|Ô|Ô|Ớ|Ớ|Ớ|Ớ|Ớ|Ớ|Ờ|Ờ|Ờ|Ờ|Ờ|Ờ|Ở|Ở|Ở|Ở|Ở|Ở|Ỡ|Ỡ|Ỡ|Ỡ|Ỡ|Ỡ|Ợ|Ợ|Ợ|Ợ|Ợ|Ợ|Ơ|Ơ|Ó|Ó|Ò|Ò|Õ|Õ|Ỏ|Ỏ|Ọ|Ọ|Ứ|Ứ|Ứ|Ứ|Ứ|Ứ|Ừ|Ừ|Ừ|Ừ|Ừ|Ừ|Ử|Ử|Ử|Ử|Ử|Ử|Ữ|Ữ|Ữ|Ữ|Ữ|Ữ|Ự|Ự|Ự|Ự|Ự|Ự|Ư|Ư|Ú|Ú|Ù|Ù|Ủ|Ủ|Ũ|Ũ|Ụ|Ụ|Ý|Ý|Ỳ|Ỳ|Ỷ|Ỷ|Ỹ|Ỹ|Ỵ|Ỵ)++(
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
    

    由于替换是在原始字符类之后形成的,因此中的元音首先会尝试[AZ] ,导致重复在遇到迷路组合标记时提前终止。

    Since the alternations are formed after the original character class, the vowels in [A-Z] will be tried first, leading to the repetition terminating early when it encounters a stray combining mark.

    • Vietnamese Alphabetical System
    • Common Vietnamese Input Method
    • Optimized Unicode Composition and Decomposition

    以下是测试程序的源代码。

    Below is the source code of the testing program.

    在ideone上演示

    import java.util.regex.*;
    import java.text.*;
    
    class Ideone
    {
        public static void main (String[] args) throws java.lang.Exception
        {
            String VIETNAMESE_DIACRITIC_CHARACTERS 
                = "ẮẰẲẴẶĂẤẦẨẪẬÂÁÀÃẢẠĐẾỀỂỄỆÊÉÈẺẼẸÍÌỈĨỊỐỒỔỖỘÔỚỜỞỠỢƠÓÒÕỎỌỨỪỬỮỰƯÚÙỦŨỤÝỲỶỸỴ";
            /*
            for (char c: VIETNAMESE_DIACRITIC_CHARACTERS.toCharArray()) {
                System.out.println(c + ": " + Character.getName(c));
            }
            */
    
            String tests[] = new String[3];
            tests[0] = 
                "Bạn chính là tác giả của Wikipedia!\n" + 
                "Mọi người đều có thể biên tập bài ngay lập tức, chỉ cần nhớ vài quy tắc." +
                "Có sẵn rất nhiều trang trợ giúp như tạo bài, sửa bài hay tải ảnh." + 
                "Bạn cũng đừng ngại đặt câu hỏi.\n" +
                "Hiện chúng ta có 1.109.446 bài viết và 406.782 thành viên.";
    
            tests[1] =
                Normalizer.normalize(tests[0], Normalizer.Form.NFD);
            /*
            for (char c: tests[1].toCharArray()) {
                System.out.printf("%04x ", (int) c);
            }
            */  
            tests[2] =
                Normalizer.normalize(tests[0], Normalizer.Form.NFC);
    
            try {
                Pattern p = Pattern.compile("(?:[" + VIETNAMESE_DIACRITIC_CHARACTERS + "]|[A-Z])++", Pattern.CANON_EQ | Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE);
    
                for (String t: tests) {
                    Matcher m = p.matcher(t);
                    while (m.find()) {
                        System.out.print(m.group() + " ");
                    }
                    System.out.println();
                }
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }
    

    这篇关于Java Regex匹配越南字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆