正则表达式匹配阿拉伯语关键字 [英] Regex match Arabic keyword

查看:44
本文介绍了正则表达式匹配阿拉伯语关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的正则表达式,可以在文本中找到一些单词:

I have simple regex which founds some word in text:

var patern = new RegExp("\bsomething\b", "gi");

此匹配文本中带有空格或标点符号的单词.

This match word in text with spaces or punctuation around.

所以匹配:

I have something.

但不匹配:

I havesomething.

什么是好的,正是我需要的.

what is fine and exactly what I need.

但我对阿拉伯语有疑问.如果我有正则表达式:

But I have issue with for example Arabic language. If I have regex:

var patern = new RegExp("\bرياضة\b", "gi");

和文字:

رياضة أنا أحب رياضتي وأنا سعيد حقا هنا لها حبي 

我要找的关键字在文末.

The keyword which I am looking for is at the end of the text.

但这不起作用,它只是找不到它.

But this doesn't work, it just doesn't find it.

如果我从正则表达式中删除 \b 就可以了:

It works if I remove \b from regex:

var patern = new RegExp("رياضة", "gi");

但这就是我现在想要的,因为如果它是另一个单词的一部分(如上面的英文示例),我不想找到它:

But that is now what I want, because I don't want to find it if it's part of another word like in english example above:

 I havesomething.

所以我真的对正则表达式知之甚少,如果有人可以帮助我使用英语和阿拉伯语等语言来解决这个问题.

So I really have low knowledge about regex and if anyone can help me to work this with english and languages like arabic.

推荐答案

我们首先要明白\b是什么意思:

We have first to understand what does \b mean:

\b 是一个定位点,它匹配一个被称为词边界"的位置.

\b is an anchor that matches at a position that is called a "word boundary".

就您而言,您要查找的单词边界没有其他阿拉伯字母.

In your case, the word boundaries that you are looking for are not having other Arabic letters.

为了在 Regex 中只匹配阿拉伯字母,我们使用 unicode:

To match only Arabic letters in Regex, we use unicode:

[\u0621-\u064A]+

或者我们可以直接使用阿拉伯字母

Or we can simply use Arabic letters directly

[ء-ي]+

上面的代码将匹配任何阿拉伯字母.为了用它来做一个单词边界,我们可以简单地在两边颠倒它:

The code above will match any Arabic letters. To make a word boundary out of it, we could simply reverse it on both sides:

[^ء-ي]ARABIC TEXT[^ء-ي]

上面的代码意味着:不要匹配任何适用于您的情况的阿拉伯语单词两侧的任何阿拉伯字符.

The code above means: don't match any Arabic characters on either sides of an Arabic word which will work in your case.

考虑一下你给我们的这个例子,我做了一点修改:

Consider this example that you gave us which I modified a little bit:

 أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا 

如果我们试图只匹配 رياض,这个词将使我们的搜索也匹配 رياضةرياضياترياضتي.但是,如果我们添加上面的代码,则匹配将仅在 رياض 上成功.

If we are trying to match only رياض, this word will make our search match also رياضة, رياضيات, and رياضتي. However, if we add the code above, the match will successfully be on رياض only.

var x = " أنا أحب رياضتي رياض رياضة رياضيات وأنا سعيد حقا هنا ";
x = x.replace(/([^ء-ي]رياض[^ء-ي])/g, '<span style="color:red">$1</span>');
document.write (x);

如果你想用一个代码来解释 أآإا,你可以使用类似这样的 [\u0622\u0623\u0625\u0627] 或者简单地列出它们在方括号[أآإا]之间.这是完整的代码

If you would like to account for أآإا with one code, you could use something like this [\u0622\u0623\u0625\u0627] or simply list them all between square brackets [أآإا]. Here is a complete code

var x = "أنا هنا وانا هناك .. آنا هنا وإنا هناك";
x = x.replace(/([أآإا]نا)/g, '<span style="color:red">$1</span>');
document.write (x);

注意:如果您想匹配正则表达式中所有可能的阿拉伯字符,包括所有阿拉伯字母أ ب ت ث ج、所有变音符号َ ً ُ ٌ ٍِ ّ,以及所有阿拉伯数字 ١٢٣٤٥٦٧٨٩٠,使用这个正则表达式:[،-٩]+

Note: If you want to match every possible Arabic characters in Regex including all Arabic letters أ ب ت ث ج, all diacritics َ ً ُ ٌ ِ ٍ ّ, and all Arabic numbers ١٢٣٤٥٦٧٨٩٠, use this regex: [،-٩]+

有关 Unicode 中阿拉伯字符排名的有用链接:https://en.wikipedia.org/wiki/Arabic_script_in_Unicode

Useful link about the ranking of Arabic characters in Unicode: https://en.wikipedia.org/wiki/Arabic_script_in_Unicode

这篇关于正则表达式匹配阿拉伯语关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆