JavaScript正则表达式将词边界与变音符号相匹配 [英] JavaScript Regex to Match Boundaries of Words with diacritics
问题描述
我必须在文本文档中匹配,具有变音符号的单词的边界。给定字
令牌,我的正则表达式看起来像
I have to match in a text document, words boundaries for words having diacritics. Given a word
token, my regex looks like
var wordRegex = new RegExp("\\b(" + word + ")\\b", "g");
while ((match = wordRegex.exec(text)) !== null) {
if (match.index > (seen.get(token) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + token.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(token, wordEnd);
break;
}
}
这适用于像<$ c $这样的普通单词c> ciao , casa
等等。但是当我在però这样的文字中时它不会起作用
,così
等。
This works ok for ordinary words like ciao
, casa
, etc. But it will not works when I have in the text words like però
, così
, etc.
const seen = new Map();
var text = "Ci son macchine nascoste e, però, nascoste male"
var tokens = text.split(/[^a-zA-Z0-9àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]+/i)
tokens.forEach((token, tokenIndex) => {
var item = {
"index": (tokenIndex + 1),
"word": token
}
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
var wordRegex = new RegExp("\\b(" + escaped + ")\\b", "g");
var match = null;
console.log(token, "---->", wordRegex)
while ((match = wordRegex.exec(text)) !== null) {
console.log("\t---->", match.index)
if (match.index > (seen.get(token) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + token.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(token, wordEnd);
break;
}
}
})
你可以看到一些单词(如 macchine
或 nascoste
)匹配,所以我得到 match.index
,换句话说(比如però
)正则表达式无法正常工作且匹配
变量 null
:
You can see how while some words (like macchine
or nascoste
) it matches, so I get the match.index
, for other words (like però
) the regex does not work properly and the match
variable is null
:
macchine ----> /\b(macchine)\b/g
----> 7
nascoste ----> /\b(nascoste)\b/g
----> 16
e, ----> /\b(e\,)\b/g
però, ----> /\b(però\,)\b/g
nascoste ----> /\b(nascoste)\b/g
----> 16
----> 34
如何编写支持变音符号的边界正则表达式呢?
How to write a boundary regex that supports diacritics too then?
[更新]
按照评论中建议的方法,我已经为每个单词令牌
在应用正则表达式
之前,再到整个文本
之前:
[UPDATE]
Following the approach suggested in the comments, I have used diacritics removal for each word token
before applying the Regex
, and then to the whole text
like:
var normalizedText = removeDiacritics(text);
// for each token...
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
escaped = removeDiacritics(escaped);
var wordRegex = new RegExp("\\b(" + escaped + ")\\b", "g");
var match = null;
while ((match = wordRegex.exec( normalizedText )) !== null)
{
//...
这次我会得到 \b
字边界捕获的重音词。当然这种方法并不是最优的,因为必须为每个令牌应用 removeDiacritics
,所以最好的解决方案就是这样做一次。
and this time I will get the words with accents captured by the \b
word boundaries. Of course this approach is not optimal, because the removeDiacritics
must be applied for every token, so the best solution would be to do this once.
推荐答案
这是我们在评论中提出的解决方案,用于将具有变音符号的单词映射到文本中的索引:
This is the solution we came up with in the comments to map words having diacritics to their index in the text:
function removeDiacritics(text) {
return _.deburr(text)
}
const seen = new Map();
var text = "Ci son macchine nascoste e, però, nascoste male"
var tokens = text.split(/[^a-zA-Z0-9àèéìíîòóùúÀÈÉÌÍÎÒÓÙÚ]+/i)
var normalizedText = removeDiacritics(text)
tokens.forEach((token, tokenIndex) => {
var item = {
"index": (tokenIndex + 1),
"word": removeDiacritics(token)
}
var escaped = token.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
escaped = removeDiacritics(escaped)
var wordRegex = new RegExp("\\b(" + escaped + ")\\b", "g");
var match = null;
console.log(token, "---->", wordRegex)
while ((match = wordRegex.exec(normalizedText)) !== null) {
console.log("\t---->", match.index)
if (match.index > (seen.get(token) || -1)) {
var wordStart = match.index;
var wordEnd = wordStart + token.length - 1;
item.characterOffsetBegin = wordStart;
item.characterOffsetEnd = wordEnd;
seen.set(token, wordEnd);
break;
}
}
})
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.11/lodash.min.js"></script>
这篇关于JavaScript正则表达式将词边界与变音符号相匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!