用于重音字符的具体Javascript正则表达式(变音符号) [英] Concrete Javascript Regex for Accented Characters (Diacritics)

查看:140
本文介绍了用于重音字符的具体Javascript正则表达式(变音符号)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我查看了Stack Overflow(替换字符..呃 JavaScript如何不遵循有关RegExp的Unicode标准等等,并没有真正找到问题的具体答案:

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question:

JavaScript如何匹配重音字符(带有变音符号?

我强制UI中的字段匹配格式: last_name,first_name (最后[逗号空间]第一个),我想提供对变音符号的支持,但显然在JavaScript中它比其他语言/平台要困难一些。

I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first), and I want to provide support for diacritics, but evidently in JavaScript it's a bit more difficult than other languages/platforms.

这是我的原始版本,直到我想添加变音支持:

This was my original version, until I wanted to add diacritic support:

/ ^ [a -zA-Z] +,\ [[a-zA-Z] + $ /

目前我正在讨论增加支持的三种方法之一,所有这些我都经过测试和工作(至少在某种程度上,我真的不知道第二种范围是什么方法)。它们是:

Currently I'm debating one of three methods to add support, all of which I have tested and work (at least to some extent, I don't really know what the "extent" is of the second approach). Here they are:

var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ";
// Build the full regex
var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$";
// Create a RegExp from the string version
regexCompiled = new RegExp(regex);
// regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/




  • 这正确匹配姓氏与 accentedCharacters 中任何支持的重音字符。

    • This correctly matches a last/first name with any of the supported accented characters in accentedCharacters.
    • var regex = /^.+,\s.+$/;
      




      • 这几乎可以匹配任何东西,至少是以:某事,某事。我没想到......

        • This would match for just about anything, at least in the form of: something, something. That's alright I suppose...
        • /^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/
          




          • 它匹配一系列unicode字符 - 经过测试和工作虽然我没有尝试任何疯狂的事情,只是我在语言部门看到的教职员名称的正常内容。

          • 以下是我的担忧:


            1. 第一个解决方案太有限了,而且很邋and和错综复杂。如果我忘了一两个字符就需要改变,这不太实用。

            2. 第二个解决方案更好,更简洁,但它可能比实际应用的要多得多。我找不到关于完全 匹配的任何真实文档,只是除换行符之外的任何字符的推广(来自表格)在 MDN 上)。

            3. 第三种解决方案似乎是最精确的,但有没有问题?我不是很熟悉Unicode,至少在实践中,但是看一下代码表 / < a href =http://www.utf8-chartable.de/unicode-utf8-table.pl?start=256 =noreferrer>该表的延续, \\ \\ u00C0-\ u017F 似乎相当稳固,至少对我的预期输入而言。

            1. The first solution is far too limiting, and sloppy and convoluted at that. It would need to be changed if I forgot a character or two, and that's just not very practical.
            2. The second solution is better, concise, but it probably matches far more than it actually should. I couldn't find any real documentation on exactly what . matches, just the generalization of "any character except the newline character" (from a table on the MDN).
            3. The third solution seems the be the most precise, but are there any gotchas? I'm not very familiar with Unicode, at least in practice, but looking at a code table/continuation of that table, \u00C0-\u017F seems to be pretty solid, at least for my expected input.


            • 学院获胜提交带有他们母语名称的表格(例如阿拉伯语,中文,日语等),所以我不必担心不符合拉丁字符的字符






            所以真正的问题:这三种方法中哪一种最适合这项任务?或者有更好的解决方案吗?


            So the real question(s): Which of these three approaches is most suited for the task? Or are there better solutions?

            推荐答案


            这三种方法中哪一种最适合这项任务?

            Which of these three approaches is most suited for the task?

            取决于任务:-)为了完全匹配所有拉丁字符及其重音版本,Unicode范围可能提供最佳解决方案。它们可能会扩展到所有非空格字符,这可以使用 \S 字符类来完成。

            Depends on the task :-) To match exactly all Latin characters and their accented versions, the Unicode ranges probably provide the best solution. They might be extended to all non-whitespace characters, which could be done using the \S character class.


            我正在强制UI中的字段匹配格式: last_name,first_name (最后[逗号空格])

            I'm forcing a field in a UI to match the format: last_name, first_name (last [comma space] first)

            我在这里看到的最基本的问题不是变音符号,而是空格。有几个名称由多个单词组成,例如标题。所以你应该选择最通用的,即允许除逗号之外的所有内容,首先区别于姓氏:

            The most basic problem I'm seeing here are not diacritics, but whitespaces. There are a few names that consist of multiple words, e.g. for titles. So you should go with the most generic, that is allowing everything but the comma that distinguishes first from last name:

            /[^,]+,\s[^,]+/
            

            但是你的第二个解决方案是字符类也一样好,你可能只需要关心多个commata。

            But your second solution with the . character class is just as fine, you only might need to care about multiple commata then.

            这篇关于用于重音字符的具体Javascript正则表达式(变音符号)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆