采用UTF-8编码的多语言输入验证 [英] Multi-language input validation with UTF-8 encoding

查看:124
本文介绍了采用UTF-8编码的多语言输入验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要检查用户输入的英文名称是否有效,我通常将输入与正则表达式(例如[A-Za-z])进行匹配.但是如果utf8编码需要多语言(如中文,日语等)支持,该怎么办?

To check a user input english name is valid, I would usually match the input against regular expression such as [A-Za-z]. But how can I do this if multi-language(like Chinese, Japanese etc.) support is required with utf8 encoding?

推荐答案

如果您的语言不支持适当的 Alphabetic ,则可以使用[\pL\pM\p{Nl}]简洁地近似Unicode派生的属性\p{Alphabetic}.财产直接.

You can approximate the Unicode derived property \p{Alphabetic} pretty succintly with [\pL\pM\p{Nl}] if your language doensn’t support a proper Alphabetic property directly.

请勿使用Java的\p{Alpha},因为

Don’t use Java’s \p{Alpha}, because that’s ASCII-only.

但是随后您会注意到您没有考虑破折号(\p{Pd} DashPunctuation 有效,但是包括了大多数的连字符! ),单引号(通常但不总是U + 27,U + 2BC,U + 2019或U + FF07之一),逗号或句号/句号.

But then you’ll notice that you’ve failed to account for dashes (\p{Pd} or DashPunctuation works, but that does not include most of the hyphens!), apostrophes (usually but not always one of U+27, U+2BC, U+2019, or U+FF07), comma, or full stop/period.

为防万一,您最好添加\p{Pc} ConnectorPunctuation .

You probably had better include \p{Pc} ConnectorPunctuation, just in case.

如果您具有Unicode派生的属性\p{Diacritic},则也应该使用它,因为它包括加泰罗尼亚语中加双L字母所需的中间点和人们有时使用的非组合变音符号形式.

If you have the Unicode derived property \p{Diacritic}, you should use that, too, because it includes things like the mid-dot needed for geminated L’s in Catalan and the non-combining forms of diacritic marks which people sometimes use.

但是,随后您会发现以\p{Nl}( LetterNumber )不适应的方式使用其名字使用序数的人,因此您抛出了\p{Nd}( DecimalNumber )或什至全部\pN( Number )混在一起.

But then you’ll find people who use ordinal numbers in their names in ways that \p{Nl} (LetterNumber) doesn’t accomodate, so you throw \p{Nd} (DecimalNumber) or even all of \pN (Number) into the mix.

然后您意识到亚洲名称经常需要使用ZWJ或ZWNJ才能在其脚本中正确编写,因此必须将U + 200D和U + 200C都添加到组合中,它们都为\p{Cf}( Format )字符,甚至还有 JoinControl 字符.

Then you realize that Asian names often require the use of ZWJ or ZWNJ to be written correctly in their scripts, so then you have to add U+200D and U+200C to the mix, which are both \p{Cf} (Format) characters and indeed also JoinControl ones.

等到您完成查找各种Unicode属性以了解各种许多异国情调的字符不断涌现-或者当您认为完成后,相反,您几乎可以肯定得出结论:如果您简单地允许他们使用这些字符,您会做得更好他们希望使用的名称的任何Unicode字符,例如 Tim引用的链接建议.是的,您会得到一些小丑的支持,例如əɯɐuʇƨɐ⅂əɯɐuʇƨɹᴉℲ"之类的事物,但这仅与领土有关,并且您不能以任何合理的方式排除愚蠢的名字.

By the time you’re done looking up the various Unicode properties for the various and many exotic characters that keep cropping up — or when you think you’re done, rather — you’re almost certain to conclude that you would do a much better job at this if you simply allowed them to use whatever Unicode characters for their name that they wish, as the link Tim cites advises. Yes, you’ll get a few jokers putting in things like "əɯɐuʇƨɐ⅂ əɯɐuʇƨɹᴉℲ", but that just goes with the territory, and you can’t preclude silly names in any reasonable way.

这篇关于采用UTF-8编码的多语言输入验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆