正则表达式来检测写为单词的数字 - UTF-8 输入 [英] regular expression to detect numbers written as words - UTF-8 input

查看:26
本文介绍了正则表达式来检测写为单词的数字 - UTF-8 输入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

感谢对以下问题的回答:

thanks for the answers to :

检测写成单词的数字的正则表达式":

"regular expression to detect numbers written as words" :

正则表达式检测写成单词的数字

我现在有这个工作,但是我有同样的要求,但数字是阿拉伯语(或任何其他 UTF-8)而不是英语,所以:

I now have this working, however I have the same requirement but the numbers as words are in Arabic (or any other UTF-8) and not English, so :

if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/", $str, $matches) > 0) 
   return true;

不起作用 - 我用谷歌搜索过,似乎有很多关于 preg_match 和 UTF-8 字符串的问题,但我无法得到任何可行的建议.非常感谢任何帮助.

Does not work - I've googled and there seems to be quite a few issues with preg_match and UTF-8 string but I couldn't get any of the suggestions found to work. Any help much appreciated.

推荐答案

请注意 \b 可能无法按预期工作.\b 指定了一个 词边界,但是什么被认为是PCRE 的单词字符取决于脚本运行的语言环境(查看 PCRE 转义序列 手册页):

Note that \b may not be working as you expect. \b specifies a word boundary, but what is considered a word character by PCRE depends on what locale the script is running in (take a look towards the bottom of the PCRE escape sequences manual page):

单词"字符是任何字母或数字或下划线字符,即可以是 Perl单词"一部分的任何字符.字母和数字的定义由 PCRE 的字符表控制,如果发生特定于语言环境的匹配,则可能会有所不同.例如,在fr"(法语)语言环境中,一些大于 128 的字符代码用于重音字母,这些字符代码与 \w 匹配.

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

您可能还想阅读使用 PHP 处理 UTF-8(特别是关于 PCRE 的部分).

You might also want to read Handling UTF-8 with PHP (the section on PCRE in particular).

相反,您可以使用 lookaround 结合 Unicode 字符属性来模拟单词边界:(?<=\P{L}).这断言前一个字符不是一个unicode字母".

Instead, you could use a lookaround in conjunction with a Unicode character property to emulate a word boundary: (?<=\P{L}). This asserts that the previous character is not a unicode "letter".

所以放在一起看起来像:

So all together it would look like:

/(?<=\P{L})(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\s*?){4}/

这篇关于正则表达式来检测写为单词的数字 - UTF-8 输入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆