正则表达式以检测写为单词的数字-UTF-8输入 [英] regular expression to detect numbers written as words - UTF-8 input
问题描述
感谢您的答复:
用于检测以单词形式书写的数字的正则表达式":
"regular expression to detect numbers written as words" :
我现在可以工作了,但是我有相同的要求,但是单词的数字是阿拉伯语(或其他UTF-8)而不是英语,所以:
I now have this working, however I have the same requirement but the numbers as words are in Arabic (or any other UTF-8) and not English, so :
if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/", $str, $matches) > 0)
return true;
不起作用-我已经用Google搜索过,并且preg_match和UTF-8字符串似乎有很多问题,但是我找不到任何建议可以起作用.任何帮助表示赞赏.
Does not work - I've googled and there seems to be quite a few issues with preg_match and UTF-8 string but I couldn't get any of the suggestions found to work. Any help much appreciated.
推荐答案
请注意,\b
可能无法按预期工作. \b
指定单词边界,但是PCRE认为单词字符取决于哪个关于脚本运行的语言环境(请看 PCRE转义序列手册页):
Note that \b
may not be working as you expect. \b
specifies a word boundary, but what is considered a word character by PCRE depends on what locale the script is running in (take a look towards the bottom of the PCRE escape sequences manual page):
单词"字符是任何字母或数字或下划线字符,即可以成为Perl单词"的一部分的任何字符.字母和数字的定义由PCRE的字符表控制,如果进行特定于语言环境的匹配,则可能会有所不同.例如,在"fr"(法语)语言环境中,某些大于128的字符代码用于带重音的字母,并且用\ w进行匹配.
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.
您可能还想阅读使用PHP处理UTF-8 (特别是关于PCRE的部分).
You might also want to read Handling UTF-8 with PHP (the section on PCRE in particular).
相反,您可以结合使用Unicode字符属性和环顾模拟单词边界:(?<=\P{L})
.这断言前一个字符不是不是 Unicode字母".
Instead, you could use a lookaround in conjunction with a Unicode character property to emulate a word boundary: (?<=\P{L})
. This asserts that the previous character is not a unicode "letter".
总的来说,它看起来像:
So all together it would look like:
/(?<=\P{L})(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\s*?){4}/
这篇关于正则表达式以检测写为单词的数字-UTF-8输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!