正则表达式匹配阿拉伯关键字 [英] Regex match Arabic keyword
问题描述
我有一个简单的正则表达式,可以在文本中找到一些单词:
I have simple regex which founds some word in text:
var patern = new RegExp("\bsomething\b", "gi");
此匹配的单词在文本中带有空格或标点符号.
This match word in text with spaces or punctuation around.
因此匹配:
I have something.
但不匹配:
I havesomething.
什么都好,正是我所需要的.
what is fine and exactly what I need.
但是我对阿拉伯语有疑问.如果我有正则表达式:
But I have issue with for example Arabic language. If I have regex:
var patern = new RegExp("\bرياضة\b", "gi");
和文字:
رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي
我要寻找的关键字在文本的结尾.
The keyword which I am looking for is at the end of the text.
但这不起作用,只是找不到它.
But this doesn't work, it just doesn't find it.
如果我从正则表达式中删除\b
,它会起作用:
It works if I remove \b
from regex:
var patern = new RegExp("رياضة", "gi");
但这就是我想要的,因为如果它是另一个单词的一部分,例如上面的英语示例,我不想找到它:
But that is now what I want, because I don't want to find it if it's part of another word like in english example above:
I havesomething.
因此,我对正则表达式的了解真的很少,是否有人可以帮助我使用英语和阿拉伯语等语言.
So I really have low knowledge about regex and if anyone can help me to work this with english and languages like arabic.
推荐答案
我们首先要了解\b
是什么意思:
We have first to understand what does \b
mean:
\ b是在称为单词边界"的位置匹配的锚.
\b is an anchor that matches at a position that is called a "word boundary".
在您的情况下,您要查找的边界单词没有其他阿拉伯字母.
In your case, the word boundaries that you are looking for are not having other Arabic letters.
要只匹配正则表达式中的阿拉伯字母,我们使用unicode:
To match only Arabic letters in Regex, we use unicode:
[\u0621-\u064A]+
或者我们可以直接使用阿拉伯字母
Or we can simply use Arabic letters directly
[ء-ي]+
上面的代码将匹配所有阿拉伯字母.要从中划出一个单词边界,我们可以简单地在两侧将其反转:
The code above will match any Arabic letters. To make a word boundary out of it, we could simply reverse it on both sides:
[^ء-ي]ARABIC TEXT[^ء-ي]
上面的代码表示:请不要在您所用的阿拉伯语单词的两边匹配任何阿拉伯字符.
The code above means: don't match any Arabic characters on either sides of an Arabic word which will work in your case.
考虑这个示例,您给了我们,我对此做了一些修改:
Consider this example that you gave us which I modified a little bit:
أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا
如果我们仅尝试匹配رياض
,则该单词将使我们的搜索也匹配رياضة
,رياضيات
和رياضتي
.但是,如果我们添加上面的代码,则匹配将仅成功在رياض
上.
If we are trying to match only رياض
, this word will make our search match also رياضة
, رياضيات
, and رياضتي
. However, if we add the code above, the match will successfully be on رياض
only.
var x = " أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا ";
x = x.replace(/([^ء-ي]رياض[^ء-ي])/g, '<span style="color:red">$1</span>');
document.write (x);
如果您想用一个代码来解释أآإا
,则可以使用类似[\u0622\u0623\u0625\u0627]
的名称,也可以将它们全部列在方括号[أآإا]
之间.这是完整的代码
If you would like to account for أآإا
with one code, you could use something like this [\u0622\u0623\u0625\u0627]
or simply list them all between square brackets [أآإا]
. Here is a complete code
var x = "أنا هنا وانا هناك .. آنا هنا وإنا هناك";
x = x.replace(/([أآإا]نا)/g, '<span style="color:red">$1</span>');
document.write (x);
注意:如果要匹配正则表达式中所有可能的阿拉伯字符,包括所有阿拉伯字母أ ب ت ث ج
,所有变音符号َ ً ُ ٌ ِ ٍ ّ
和所有阿拉伯数字١٢٣٤٥٦٧٨٩٠
,请使用此正则表达式:[،-٩]+
Note: If you want to match every possible Arabic characters in Regex including all Arabic letters أ ب ت ث ج
, all diacritics َ ً ُ ٌ ِ ٍ ّ
, and all Arabic numbers ١٢٣٤٥٦٧٨٩٠
, use this regex: [،-٩]+
关于Unicode中阿拉伯字符排名的有用链接: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
Useful link about the ranking of Arabic characters in Unicode: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode
这篇关于正则表达式匹配阿拉伯关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!