正则表达式以检测写为单词的数字-UTF-8输入 [英] regular expression to detect numbers written as words - UTF-8 input

查看：114 发布时间：2020/7/13 4:53:26 php regex utf-8 preg-match arabic

本文介绍了正则表达式以检测写为单词的数字-UTF-8输入的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

感谢您的答复:

用于检测以单词形式书写的数字的正则表达式":

"regular expression to detect numbers written as words" :

我现在可以工作了，但是我有相同的要求，但是单词的数字是阿拉伯语(或其他UTF-8)而不是英语，所以:

I now have this working, however I have the same requirement but the numbers as words are in Arabic (or any other UTF-8) and not English, so :

if (preg_match("/\p{L}\b(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\b\s*?){4}/", $str, $matches) > 0) 
   return true;

不起作用-我已经用Google搜索过，并且preg_match和UTF-8字符串似乎有很多问题，但是我找不到任何建议可以起作用.任何帮助表示赞赏.

Does not work - I've googled and there seems to be quite a few issues with preg_match and UTF-8 string but I couldn't get any of the suggestions found to work. Any help much appreciated.

推荐答案

请注意，\b可能无法按预期工作. \b指定单词边界，但是PCRE认为单词字符取决于哪个关于脚本运行的语言环境(请看 PCRE转义序列手册页):

Note that \b may not be working as you expect. \b specifies a word boundary, but what is considered a word character by PCRE depends on what locale the script is running in (take a look towards the bottom of the PCRE escape sequences manual page):

单词"字符是任何字母或数字或下划线字符，即可以成为Perl单词"的一部分的任何字符.字母和数字的定义由PCRE的字符表控制，如果进行特定于语言环境的匹配，则可能会有所不同.例如，在"fr"(法语)语言环境中，某些大于128的字符代码用于带重音的字母，并且用\ w进行匹配.

A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word". The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w.

您可能还想阅读使用PHP处理UTF-8 (特别是关于PCRE的部分).

You might also want to read Handling UTF-8 with PHP (the section on PCRE in particular).

相反，您可以结合使用Unicode字符属性和环顾模拟单词边界:(?<=\P{L}).这断言前一个字符不是不是 Unicode字母".

Instead, you could use a lookaround in conjunction with a Unicode character property to emulate a word boundary: (?<=\P{L}). This asserts that the previous character is not a unicode "letter".

总的来说，它看起来像:

So all together it would look like:

/(?<=\P{L})(?:(?:واحد|اثنان|ثلاثة|أربعة|خمسة|ستة|سبعة|ثمانية|تسعة|صفر|عشرة)\s*?){4}/

这篇关于正则表达式以检测写为单词的数字-UTF-8输入的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

正则表达式以检测写为单词的数字-UTF-8输入 [英] regular expression to detect numbers written as words - UTF-8 input

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

正则表达式以检测写为单词的数字-UTF-8输入 [英] regular expression to detect numbers written as words - UTF-8 input

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭