JavaScript正则表达式将词边界与变音符号相匹配 [英] JavaScript Regex to Match Boundaries of Words with diacritics

查看:85
本文介绍了JavaScript正则表达式将词边界与变音符号相匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在文本文档中匹配,具有变音符号的单词的边界。给定令牌,我的正则表达式看起来像

I have to match in a text document, words boundaries for words having diacritics. Given a word token, my regex looks like

var wordRegex = new RegExp("\\b(" + word + ")\\b", "g");
while ((match = wordRegex.exec(text)) !== null) {
                            if (match.index > (seen.get(token) || -1)) {
                                var wordStart = match.index;
                                var wordEnd = wordStart + token.length - 1;
                                item.characterOffsetBegin = wordStart;
                                item.characterOffsetEnd = wordEnd;

                                seen.set(token, wordEnd);
                                break;
                            }
                        }

这适用于像<$ c $这样的普通单词c> ciao , casa 等等。但是当我在però这样的文字中时它不会起作用così等。

This works ok for ordinary words like ciao, casa, etc. But it will not works when I have in the text words like però, così, etc.

const seen = new Map();
var text = "Ci son macchine nascoste e, però, nascoste male"
var tokens = text.split(/[^a-zA-Z0-9àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]+/i)
tokens.forEach((token, tokenIndex) => {
  var item = {
    "index": (tokenIndex + 1),
    "word": token
  }
  var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
  var wordRegex = new RegExp("\\b(" + escaped + ")\\b", "g");
  var match = null;
  console.log(token, "---->", wordRegex)
  while ((match = wordRegex.exec(text)) !== null) {
    console.log("\t---->", match.index)
    if (match.index > (seen.get(token) || -1)) {
      var wordStart = match.index;
      var wordEnd = wordStart + token.length - 1;
      item.characterOffsetBegin = wordStart;
      item.characterOffsetEnd = wordEnd;

      seen.set(token, wordEnd);
      break;
    }
  }
})

你可以看到一些单词(如 macchine nascoste )匹配,所以我得到 match.index ,换句话说(比如però)正则表达式无法正常工作且匹配变量 null

You can see how while some words (like macchine or nascoste) it matches, so I get the match.index, for other words (like però) the regex does not work properly and the match variable is null:

macchine ----> /\b(macchine)\b/g
    ----> 7
nascoste ----> /\b(nascoste)\b/g
    ----> 16
e, ----> /\b(e\,)\b/g
però, ----> /\b(però\,)\b/g
nascoste ----> /\b(nascoste)\b/g
    ----> 16
    ----> 34

如何编写支持变音符号的边界正则表达式呢?

How to write a boundary regex that supports diacritics too then?

[更新]
按照评论中建议的方法,我已经为每个单词令牌在应用正则表达式之前,再到整个文本之前:

[UPDATE] Following the approach suggested in the comments, I have used diacritics removal for each word token before applying the Regex, and then to the whole text like:

var normalizedText = removeDiacritics(text);
// for each token...
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
escaped = removeDiacritics(escaped);
var wordRegex = new RegExp("\\b(" + escaped + ")\\b", "g");
var match = null;
while ((match = wordRegex.exec( normalizedText )) !== null) 
{
                             //...

这次我会得到 \b 字边界捕获的重音词。当然这种方法并不是最优的,因为必须为每个令牌应用 removeDiacritics ,所以最好的解决方案就是这样做一次。

and this time I will get the words with accents captured by the \b word boundaries. Of course this approach is not optimal, because the removeDiacritics must be applied for every token, so the best solution would be to do this once.

推荐答案

这是我们在评论中提出的解决方案,用于将具有变音符号的单词映射到文本中的索引:

This is the solution we came up with in the comments to map words having diacritics to their index in the text:

function removeDiacritics(text) {
  return _.deburr(text)
}

const seen = new Map();
var text = "Ci son macchine nascoste e, però, nascoste male"
var tokens = text.split(/[^a-zA-Z0-9àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]+/i)
  var normalizedText = removeDiacritics(text)
  
tokens.forEach((token, tokenIndex) => {
  var item = {
    "index": (tokenIndex + 1),
    "word": removeDiacritics(token)
  }
  var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
  escaped = removeDiacritics(escaped)
  var wordRegex = new RegExp("\\b(" + escaped + ")\\b", "g");
  var match = null;
  console.log(token, "---->", wordRegex)
  while ((match = wordRegex.exec(normalizedText)) !== null) {
    console.log("\t---->", match.index)
    if (match.index > (seen.get(token) || -1)) {
      var wordStart = match.index;
      var wordEnd = wordStart + token.length - 1;
      item.characterOffsetBegin = wordStart;
      item.characterOffsetEnd = wordEnd;

      seen.set(token, wordEnd);
      break;
    }
  }
})

<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.11/lodash.min.js"></script>

这篇关于JavaScript正则表达式将词边界与变音符号相匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆