Javascript Regex + Unicode Diacritic组合字符` [英] Javascript Regex + Unicode Diacritic Combining Characters`

查看:91
本文介绍了Javascript Regex + Unicode Diacritic组合字符`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望在非洲约鲁巴语ẹ中匹配这个角色。通常这是通过在点变音符号下组合'é'和'\ u0323'来实现的。我发现:

 'é\\\̣'.match(/ [é] \ u0323 /)有效,但$ b $b'ẹ'.match(/ [é] \ u0323 /)不起作用。 

我不只是想匹配e。我想匹配所有组合。现在,我的解决方案涉及枚举所有组合。像这样: / [ÁÀĀÉÈĒẸE̩ẸÉ̩ẸÈ̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩ỌÓ̩ỌÒ̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹé̩ẹè̩ẹ̄ē̩íìīóòōọo̩ọó̩ọò̩ọ̄ō̩úùūṣs̩] /



会没有一个短,从而更好的方式来做到这一点,或者unicode diacritic组合字符的javascript中的正则表达式匹配不容易这样做?
谢谢

解决方案


通常这是通过将'é'与a'组合而成的在点变音符号下的'\ u0323'


但是,这不是你在这里所拥有的:

 'ẹ'

这不是U + 0065,U + 0323但U + 1EB9,U + 0301 - 将与急性变音符号组合。



通常的解决方案是在进行比较之前规范化每个字符串(通常为Unicode Normal Form C)。


我不知道只是想匹配e。我想匹配所有组合


没有变音符号的匹配通常通过标准化为普通表格D并删除所有组合变音字符来完成。 / p>

不幸的是,JS中没有规范化,所以如果你需要它,你必须拖入代码才能完成它,这必须包含一个大的Unicode数据表。其中一项工作是 unorm 。为了获取基于Unicode特性的字符,例如组合变量,你还需要一个支持Unicode数据库的regexp引擎,例如 XRegExp Unicode类别



服务器端语言(例如Python,.NET)通常具有对Unicode规范化的本机支持,所以如果你可以在服务器上进行通常更容易的处理。


I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by combining an 'é' with a '\u0323' under dot diacritic. I found that:

'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.

I don't just want to match e. I want to match all combinations. Right now, my solution involves enumerating all combinations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/

Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic combining characters not work this easily? Thank you

解决方案

Usually this is made by combining an 'é' with a '\u0323' under dot diacritic

However, that isn't what you have here:

'ẹ́'

that's not U+0065,U+0323 but U+1EB9,U+0301 - combining an with an acute diacritic.

The usual solution would be to normalise each string (typically to Unicode Normal Form C) before doing the comparison.

I don't just want to match e. I want to match all combinations

Matching without diacriticals is typically done by normalising to Normal Form D and removing all the combining diacritical characters.

Unfortunately normalisation is not available in JS, so if you want it you would have to drag in code to do it, which would have to include a large Unicode data table. One such effort is unorm. For picking up characters based on Unicode preoperties like being a combining diacritical, you'd also need a regexp engine with support for the Unicode database, such as XRegExp Unicode Categories.

Server-side languages (eg Python, .NET) typically have native support for Unicode normalisation, so if you can do the processing on the server that would generally be easier.

这篇关于Javascript Regex + Unicode Diacritic组合字符`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆