正则表达式以检测写为单词的数字-UTF-8输入 [英] regular expression to detect numbers written as words - UTF-8 input

查看:114
本文介绍了正则表达式以检测写为单词的数字-UTF-8输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

感谢您的答复:

用于检测以单词形式书写的数字的正则表达式":

"regular expression to detect numbers written as words" :

正则表达式可检测以单词形式书写的数字

我现在可以工作了,但是我有相同的要求,但是单词的数字是阿拉伯语(或其他UTF-8)而不是英语,所以:

I now have this working, however I have the same requirement but the numbers as words are in Arabic (or any other UTF-8) and not English, so :

if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/", $str, $matches) > 0) 
   return true;

不起作用-我已经用Google搜索过,并且preg_match和UTF-8字符串似乎有很多问题,但是我找不到任何建议可以起作用.任何帮助表示赞赏.

Does not work - I've googled and there seems to be quite a few issues with preg_match and UTF-8 string but I couldn't get any of the suggestions found to work. Any help much appreciated.

推荐答案

请注意,\b可能无法按预期工作. \b指定单词边界,但是PCRE认为单词字符取决于哪个关于脚本运行的语言环境(请看 PCRE转义序列手册页):

Note that \b may not be working as you expect. \b specifies a word boundary, but what is considered a word character by PCRE depends on what locale the script is running in (take a look towards the bottom of the PCRE escape sequences manual page):

单词"字符是任何字母或数字或下划线字符,即可以成为Perl单词"的一部分的任何字符.字母和数字的定义由PCRE的字符表控制,如果进行特定于语言环境的匹配,则可能会有所不同.例如,在"fr"(法语)语言环境中,某些大于128的字符代码用于带重音的字母,并且用\ w进行匹配.

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

您可能还想阅读使用PHP处理UTF-8 (特别是关于PCRE的部分).

You might also want to read Handling UTF-8 with PHP (the section on PCRE in particular).

相反,您可以结合使用Unicode字符属性和环顾模拟单词边界:(?<=\P{L}).这断言前一个字符不是不是 Unicode字母".

Instead, you could use a lookaround in conjunction with a Unicode character property to emulate a word boundary: (?<=\P{L}). This asserts that the previous character is not a unicode "letter".

总的来说,它看起来像:

So all together it would look like:

/(?<=\P{L})(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\s*?){4}/

这篇关于正则表达式以检测写为单词的数字-UTF-8输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆