文字比对不适用于阿拉伯文问题,可能是由于阿拉伯文的正则表达式 [英] Text Matching not working for Arabic issue may be due to regex for arabic

查看:144
本文介绍了文字比对不适用于阿拉伯文问题,可能是由于阿拉伯文的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力向我的多语言网站添加功能,在该网站上我必须突出显示匹配的标签关键字.

此功能适用于英文版,但不适用于阿拉伯语版.

我已经在 jsFiddle

上设置了示例

示例代码

    function HighlightKeywords(keywords)
    {        
        var el = $("#article-detail-desc");
        var language = "ar-AE";
        var pid = 32;
        var issueID = 18; 
        $(keywords).each(function()
        {
           // var pattern = new RegExp("("+this+")", ["gi"]); //breaks html
            var pattern = new RegExp("(\\b"+this+"\\b)(?![^<]*?>)", ["gi"]); //looks for match outside html tags
            var rs = "<a class='ad-keyword-selected' href='http://www.alshindagah.com/ar/search.aspx?Language="+language+"&PageId="+pid+"&issue="+issueID+"&search=$1' title='Seach website for:  $1'><span style='color:#990044; tex-decoration:none;'>$1</span></a>";
            el.html(el.html().replace(pattern, rs));
        });
    }   

HighlightKeywords(["you","الهدف","طهران","سيما","حاليا","Hello","34","english"]);

//Popup Tooltip for article keywords
     $(function() {
        $("#article-detail-desc").tooltip({
        position: {
            my: "center bottom-20",
            at: "center top",
            using: function( position, feedback ) {
            $( this ).css( position );
            $( "<div>" )
            .addClass( "arrow" )
            .addClass( feedback.vertical )
            .addClass( feedback.horizontal )
            .appendTo( this );
        }
        }
        });
    });

我将关键字存储在数组&然后将它们与特定div中的文本匹配.

由于Unicode或其他原因,我不确定是否存在问题.感谢在这方面的帮助.

解决方案

此答案分为三个部分

  1. 为什么不起作用

  2. 以英语为母语的示例(可能被对阿拉伯语有一定了解的人改编为阿拉伯语)

  3. 一个不懂阿拉伯语的人(我)对阿拉伯语版本的尝试:-)

为什么不起作用

至少部分问题是您所依赖的规范:

生产断言:: \ b通过返回内部AssertionTester闭包进行评估,该闭包采用State自变量x并执行以下操作:

  • e成为xendIndex.
  • 调用IsWordChar(e–1)并让a作为Boolean结果.
  • 调用IsWordChar(e)并让b作为Boolean结果.
  • 如果atrue并且bfalse,则返回true.
  • 如果afalse并且btrue,则返回true.
  • 返回false.

......,其中IsWordChar进一步定义为基本上意味着这63个字符之一:

a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
0  1  2  3  4  5  6  7  8  9  _    

例如,大写或小写的26个英文字母az,数字09_. (这意味着您甚至不能依赖英语中的\b\B\w\W,因为English具有类似"Voilà"的外来词,但这又是另一回事了.)

使用英语的第一个例子

您必须使用其他机制来检测阿拉伯语中的单词边界.如果您可以提出一个字符类,其中包括构成单词的所有阿拉伯语代码点"(如Unicode所述),则可以使用如下代码:

var keywords = {
    "laboris": true,
    "laborum": true,
    "pariatur": true
    // ...and so on...
};
var text = /*... get the text to work on... */;
text = text.replace(
    /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
    replacer);

function replacer(m, c0, c1) {
    if (keywords[c0]) {
        c0 = '<a href="#">' + c0 + '</a>';
    }
    return c0 + c1;
}

注意事项:

  • 我已经使用类[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]来表示单词字符".显然,您必须(明显地)将其更改为阿拉伯语.
  • 我已经使用类[^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]来表示不是文字字符".这与一开始就带有否定(^)的上一类相同.
  • 正则表达式使用两个捕获组((...))查找任意系列的单词字符",后跟一系列可选个非单词字符.
  • String#replace调用replacer函数,将匹配的全文文本与每个捕获组后面的参数作为参数.
  • replacer函数在keywords映射中查找第一个捕获组(单词),以查看它是否为关键字.如果是这样,则将其包装在锚点中.
  • replacer函数返回可能包裹的单词以及其后的非单词文本.
  • String#replace使用replacer中的返回值替换匹配的文本.

这是执行此操作的完整示例:实时复制 | 实时源

<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8 />
<title>Replacing Keywords</title>
</head>
<body>
  <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>

  <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
  <script>
    (function() {
      // Our keywords. There are lots of ways you can produce
      // this map, here I've just done it literally
      var keywords = {
        "laboris": true,
        "laborum": true,
        "pariatur": true
      };

      // Loop through all our paragraphs (okay, so we only have one)
      $("p").each(function() {
        var $this, text;

        // We'll use jQuery on `this` more than once,
        // so grab the wrapper
        $this = $(this);

        // Get the text of the paragraph
        // Note that this strips off HTML tags, a
        // real-world solution might need to loop
        // through the text nodes rather than act
        // on the full text all at once
        text = $this.text();

        // Do the replacements
        // These character classes match JavaScript's
        // definition of a "word" character and so are
        // English-centric, obviously you'd change that
        text = text.replace(
          /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
          replacer);

        // Update the paragraph
        $this.html(text);
      });

      // Our replacer. We define it separately rather than
      // inline because we use it more than once      
      function replacer(m, c0, c1) {
        // Is the word in our keywords map?
        if (keywords[c0]) {
          // Yes, wrap it
          c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
      }
    })();
  </script>
</body>
</html>

使用阿拉伯语的尝试

我对阿拉伯文版本持怀疑态度.根据Wikipedia上 Unicode页中的阿拉伯语脚本,使用了多个代码范围,但所有您示例中的文字在U + 0600到U + 06FF的主要范围内.

这是我想出的:小提琴(我更喜欢JSBin,我在上面使用过,但是我无法使文字正确显示.)

(function() {
    // Our keywords. There are lots of ways you can produce
    // this map, here I've just done it literally
    var keywords = {
        "الهدف": true,
        "طهران": true,
        "سيما": true,
        "حاليا": true
    };

    // Loop through all our paragraphs (okay, so we only have two)
    $("p").each(function() {
        var $this, text;

        // We'll use jQuery on `this` more than once,
        // so grab the wrapper
        $this = $(this);

        // Get the text of the paragraph
        // Note that this strips off HTML tags, a
        // real-world solution might need to loop
        // through the text nodes rather than act
        // on the full text all at once
        text = $this.text();

        // Do the replacements
        // These character classes just use the primary
        // Arabic range of U+0600 to U+06FF, you may
        // need to add others.
        text = text.replace(
            /([\u0600-\u06ff]+)([^\u0600-\u06ff]+)?/g,
            replacer);

        // Update the paragraph
        $this.html(text);
    });

    // Our replacer. We define it separately rather than
    // inline because we use it more than once      
    function replacer(m, c0, c1) {
        // Is the word in our keywords map?
        if (keywords[c0]) {
            // Yes, wrap it
            c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
    }
})();

我对上面的英语功能所做的全部工作是:

  • [\u0600-\u06ff]用作单词字符",将[^\u0600-\u06ff]用作非单词字符".您可能需要添加其他一些在此处列出的范围(例如适当的数字样式) ,但同样,示例中的所有文本都属于这些范围.
  • 将示例中的关键字更改为您的三个(文本中似乎只有两个).

对于我的非常非阿拉伯语阅读者来说,它似乎可以正常工作.

I have been working to add a functionality to my multilingual website where i have to highlight the matching tag keywords.

This functionality works for English version but doesn't not fire for arabic version.

I have set up sample on jsFiddle

Sample Code

    function HighlightKeywords(keywords)
    {        
        var el = $("#article-detail-desc");
        var language = "ar-AE";
        var pid = 32;
        var issueID = 18; 
        $(keywords).each(function()
        {
           // var pattern = new RegExp("("+this+")", ["gi"]); //breaks html
            var pattern = new RegExp("(\\b"+this+"\\b)(?![^<]*?>)", ["gi"]); //looks for match outside html tags
            var rs = "<a class='ad-keyword-selected' href='http://www.alshindagah.com/ar/search.aspx?Language="+language+"&PageId="+pid+"&issue="+issueID+"&search=$1' title='Seach website for:  $1'><span style='color:#990044; tex-decoration:none;'>$1</span></a>";
            el.html(el.html().replace(pattern, rs));
        });
    }   

HighlightKeywords(["you","الهدف","طهران","سيما","حاليا","Hello","34","english"]);

//Popup Tooltip for article keywords
     $(function() {
        $("#article-detail-desc").tooltip({
        position: {
            my: "center bottom-20",
            at: "center top",
            using: function( position, feedback ) {
            $( this ).css( position );
            $( "<div>" )
            .addClass( "arrow" )
            .addClass( feedback.vertical )
            .addClass( feedback.horizontal )
            .appendTo( this );
        }
        }
        });
    });

I store keywords in array & then match them with the text in a particular div.

I am not sure is problem due to Unicode or what. Help in this respect is appreciated.

解决方案

There are three sections to this answer

  1. Why it's not working

  2. An example of how you could approach it in English (meant to be adapted to Arabic by someone with a clue about Arabic)

  3. A stab at doing the Arabic version by someone (me) who hasn't a clue about Arabic :-)

Why it's not working

At least part of the problem is that you're relying on the \b assertion, which (like its counterparts \B, \w, and \W) is English-centric. You can't rely on it in other languages (or even, really, in English — see below).

Here's the definition of \b in the spec:

The production Assertion :: \ b evaluates by returning an internal AssertionTester closure that takes a State argument x and performs the following:

  • Let e be x's endIndex.
  • Call IsWordChar(e–1) and let a be the Boolean result.
  • Call IsWordChar(e) and let b be the Boolean result.
  • If a is true and b is false, return true.
  • If a is false and b is true, return true.
  • Return false.

...where IsWordChar is defined further down as basically meaning one of these 63 characters:

a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r  s  t  u  v  w  x  y  z
A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z
0  1  2  3  4  5  6  7  8  9  _    

E.g., the 26 English letters a to z in upper or lower case, the digits 0 to 9, and _. (This means you can't even rely on \b, \B, \w, or \W in English, because English has loan words like "Voilà", but that's another story.)

A first example using English

You'll have to use a different mechanism for detecting word boundaries in Arabic. If you can come up with a character class that includes all of the Arabic "code points" (as Unicode puts it) that make up words, you could use code a bit like this:

var keywords = {
    "laboris": true,
    "laborum": true,
    "pariatur": true
    // ...and so on...
};
var text = /*... get the text to work on... */;
text = text.replace(
    /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
    replacer);

function replacer(m, c0, c1) {
    if (keywords[c0]) {
        c0 = '<a href="#">' + c0 + '</a>';
    }
    return c0 + c1;
}

Notes on that:

  • I've used the class [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] to mean "a word character". Obviously you'd have to change this (markedly) for Arabic.
  • I've used the class [^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ] to mean "not a word character". This is just the same as the previous class with the negation (^) at the outset.
  • The regular expression finds any series of "word characters" followed by an optional series of non-word characters, using capture groups ((...)) for both.
  • String#replace calls the replacer function with the full text that matched followed by each capture group as arguments.
  • The replacer function looks up the first capture group (the word) in the keywords map to see if it's a keyword. If so, it wraps it in an anchor.
  • The replacer function returns that possibly-wrapped word plus the non-word text that followed it.
  • String#replace uses the return value from replacer to replace the matched text.

Here's a full example of doing that: Live Copy | Live Source

<!DOCTYPE html>
<html>
<head>
<meta charset=utf-8 />
<title>Replacing Keywords</title>
</head>
<body>
  <p>Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</p>

  <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
  <script>
    (function() {
      // Our keywords. There are lots of ways you can produce
      // this map, here I've just done it literally
      var keywords = {
        "laboris": true,
        "laborum": true,
        "pariatur": true
      };

      // Loop through all our paragraphs (okay, so we only have one)
      $("p").each(function() {
        var $this, text;

        // We'll use jQuery on `this` more than once,
        // so grab the wrapper
        $this = $(this);

        // Get the text of the paragraph
        // Note that this strips off HTML tags, a
        // real-world solution might need to loop
        // through the text nodes rather than act
        // on the full text all at once
        text = $this.text();

        // Do the replacements
        // These character classes match JavaScript's
        // definition of a "word" character and so are
        // English-centric, obviously you'd change that
        text = text.replace(
          /([abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_]+)?/g,
          replacer);

        // Update the paragraph
        $this.html(text);
      });

      // Our replacer. We define it separately rather than
      // inline because we use it more than once      
      function replacer(m, c0, c1) {
        // Is the word in our keywords map?
        if (keywords[c0]) {
          // Yes, wrap it
          c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
      }
    })();
  </script>
</body>
</html>

A stab at doing it with Arabic

I took at stab at the Arabic version. According to the Arabic script in Unicode page on Wikipedia, there are several code ranges used, but all of the text in your example fell into the primary range of U+0600 to U+06FF.

Here's what I came up with: Fiddle (I prefer JSBin, what I used above, but I couldn't get the text to come out the right way around.)

(function() {
    // Our keywords. There are lots of ways you can produce
    // this map, here I've just done it literally
    var keywords = {
        "الهدف": true,
        "طهران": true,
        "سيما": true,
        "حاليا": true
    };

    // Loop through all our paragraphs (okay, so we only have two)
    $("p").each(function() {
        var $this, text;

        // We'll use jQuery on `this` more than once,
        // so grab the wrapper
        $this = $(this);

        // Get the text of the paragraph
        // Note that this strips off HTML tags, a
        // real-world solution might need to loop
        // through the text nodes rather than act
        // on the full text all at once
        text = $this.text();

        // Do the replacements
        // These character classes just use the primary
        // Arabic range of U+0600 to U+06FF, you may
        // need to add others.
        text = text.replace(
            /([\u0600-\u06ff]+)([^\u0600-\u06ff]+)?/g,
            replacer);

        // Update the paragraph
        $this.html(text);
    });

    // Our replacer. We define it separately rather than
    // inline because we use it more than once      
    function replacer(m, c0, c1) {
        // Is the word in our keywords map?
        if (keywords[c0]) {
            // Yes, wrap it
            c0 = '<a href="#">' + c0 + '</a>';
        }
        return c0 + c1;
    }
})();

All I did to my English function above was:

  • Use [\u0600-\u06ff] to be "a word character" and [^\u0600-\u06ff] to be "not a word character". You may need to add some of the other ranges listed here (such as the appropriate style of numerals), but again, all of the text in your example fell into those ranges.
  • Change the keywords to be three of yours from your example (only two of which seem to be in the text).

To my very non-Arabic-reading eyes, it seems to work.

这篇关于文字比对不适用于阿拉伯文问题,可能是由于阿拉伯文的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆