JavaScript中的程序口音减少(又称文本规范化或不合理) [英] Programatic Accent Reduction in JavaScript (aka text normalization or unaccenting)

查看:202
本文介绍了JavaScript中的程序口音减少(又称文本规范化或不合理)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将2个字符串比较为相等,如下所示:

I need to compare 2 strings as equal such as these:


Lubeck ==Lübeck

Lubeck == Lübeck

在JavaScript中。

In JavaScript.

为什么?好吧,我有一个自动完成字段,使用Lucene进行Java服务,其中地名自然存储(如Lübeck),但也被索引为规范化文本,

Why? Well, I have an auto-completion field that's going out to a Java service using Lucene, where place names are stored naturally (as Lübeck), but also indexed as normalized text,

import sun.text.Normalizer;
oDoc.setNameLC = Normalizer.normalize(oLocName, Normalizer.DECOMP, 0)
    .toLowerCase().replaceAll("[^\\p{ASCII}]","");

这样一来,一个不知道输入Mèxico的人可以输入mexico和得到一个返回Mèxico的比赛(在很多其他可能的点击中,如CaféMèxico,迪拜,阿联酋)。

This way some-one who doesn't know to type "Mèxico" can type "mexico" and get a match which returns "Mèxico" (among a lot of other possible hits, like "Café Mèxico, Dubai, UAE").

现在我不知道有能力更改服务以在服务器端进行任何突出显示,因此我在客户端JavaScript方面突出显示如下内容:

Now the thing is I don't have the ability to change the service to do any highlighting on the server side, therefore I am highlighting on the client JavaScript side with something like:

return result.replace( input.replace(/[aeiou]/g,"."), "<b>$1</b>");

因为我在输入中转义了特殊的正则表达式字符,所以更加花哨。这对于在点击开头的简单单词匹配很好,但是如果你突然想要支持像伦敦咖啡馆这样的多字匹配,它真的会崩溃:

It's a little more fancy because I am escaping special regex characters in the input. This is fine for simple one word matches at the beginning of a hit, but it really breaks down if you suddenly wish to support multi-word matches like "london cafe":

input = input.strip().toLowerCase(); //fyi prototype's strip is like trim
re = new RegEx(input.replace(/[aeiou]/g,".").replace(/\s+/g,"|"),"gi");
return result.replace(re, "<b>$1</b>");

这不适用于说london ca(正在输入伦敦咖啡馆),因为它会将Jack London Cabin,Dawson City,Canada标记为:Ja< b> ck< / b>< b> London< / b>< b> ca< / b> bin,Dawson < b> Ci< / b> ty< b> Ca< b> nada [注意ck和Ci特别是]

This doesn't work for say "london ca" (was typing london cafe), because it would mark "Jack London Cabin, Dawson City, Canada" as: "Ja<b>ck</b> <b>London</b> <b>ca</b>bin, Dawson <b>Ci</b>ty, <b>Ca<b/>nada" [note the "ck" and "Ci" particularly]

因此,我有点像寻找一些不那么疯狂的东西:

Therefore I'm sort of looking for something that's not as crazy as:

input = input.strip().toLowerCase();
input = input.replace(/a/g,"[ÀàÁáÂâÃãÄäÅåÆæĀāĂ㥹]");
input = input.replace(/e/g,"[ÈèÉéÊêËëĒēĔĕĖėĘęĚě]");
// ditto for i, o, u, y, c, n, maybe also d, g, h, j, k, l, r, s, t, w, z 
re = new RegEx(input.replace(/\s+/g,"|"),"gi");
return result.replace(re, "<b>$1</b>");

是否有编译表我可以参考映射一系列字符,这些字符是另一个的重音版本这个角色的角色,我并不是指简单的unicode图表。如果是这样,我可以避免使用奇怪的,可能很慢的RegEx语句吗?

Is there a compiled table I can refer to mapping a range of characters which are accented versions of an other character to that character, by which I don't mean the plain unicode chart. And if so, could I avoid using weird, possibly slow, RegEx statements?

关于赏金:

之前我开始赏金,有两个答案,一个指向我在Ruby中做,并且 MizzardX 写的那个是完成的我提出的基本形式。现在不要误会我的意思,我真的很感激像他一样完成它,但我只是希望可能有另一种方式。到目前为止,每个看过问题和答案的人都认为MizzardX很好地覆盖它,或者他们没有不同的方法。我会对一种不同的方法感兴趣,如果它在赏金结束前根本无法获得,MizzardX将赢得赏金(虽然在一个残酷的扭曲,他的编辑疯狂它社区wiki答案,所以我不确定是否他会获得赏金!)

About the bounty:
Before I started a bounty there were two answers, the one pointing me to doing it in Ruby, and the one that MizzardX wrote which was a completion of the basic form I'd put in my question. Now don't get me wrong, I really appreciate working it out as completely as he did, but I just wished that there might be another way. It seems so far that everyone who's dropped by to look at the question and answer has decided that MizzardX covers it just fine, or that they have no different approach. I would be interested in a different approach, and if it simply isn't available before the bounty closes, MizzardX will win the bounty (though in a cruel twist, his edits mad it a community wiki answer, so I'm not sure if he'll get the bounty!)

推荐答案

/**
 * Creates a RegExp that matches the words in the search string.
 * Case and accent insensitive.
 */
function make_pattern(search_string) {
    // escape meta characters
    search_string = search_string.replace(/([|()[{.+*?^$\\])/g,"\\$1");

    // split into words
    var words = search_string.split(/\s+/);

    // sort by length
    var length_comp = function (a,b) {
        return b.length - a.length;
    };
    words.sort(length_comp);

    // replace characters by their compositors
    var accent_replacer = function(chr) {
        return accented[chr.toUpperCase()] || chr;
    }
    for (var i = 0; i < words.length; i++) {
        words[i] = words[i].replace(/\S/g,accent_replacer);
    }

    // join as alternatives
    var regexp = words.join("|");
    return new RegExp(regexp,'g');
}

var accented = {
    'A': '[Aa\xaa\xc0-\xc5\xe0-\xe5\u0100-\u0105\u01cd\u01ce\u0200-\u0203\u0226\u0227\u1d2c\u1d43\u1e00\u1e01\u1e9a\u1ea0-\u1ea3\u2090\u2100\u2101\u213b\u249c\u24b6\u24d0\u3371-\u3374\u3380-\u3384\u3388\u3389\u33a9-\u33af\u33c2\u33ca\u33df\u33ff\uff21\uff41]',
    'B': '[Bb\u1d2e\u1d47\u1e02-\u1e07\u212c\u249d\u24b7\u24d1\u3374\u3385-\u3387\u33c3\u33c8\u33d4\u33dd\uff22\uff42]',
    'C': '[Cc\xc7\xe7\u0106-\u010d\u1d9c\u2100\u2102\u2103\u2105\u2106\u212d\u216d\u217d\u249e\u24b8\u24d2\u3376\u3388\u3389\u339d\u33a0\u33a4\u33c4-\u33c7\uff23\uff43]',
    'D': '[Dd\u010e\u010f\u01c4-\u01c6\u01f1-\u01f3\u1d30\u1d48\u1e0a-\u1e13\u2145\u2146\u216e\u217e\u249f\u24b9\u24d3\u32cf\u3372\u3377-\u3379\u3397\u33ad-\u33af\u33c5\u33c8\uff24\uff44]',
    'E': '[Ee\xc8-\xcb\xe8-\xeb\u0112-\u011b\u0204-\u0207\u0228\u0229\u1d31\u1d49\u1e18-\u1e1b\u1eb8-\u1ebd\u2091\u2121\u212f\u2130\u2147\u24a0\u24ba\u24d4\u3250\u32cd\u32ce\uff25\uff45]',
    'F': '[Ff\u1da0\u1e1e\u1e1f\u2109\u2131\u213b\u24a1\u24bb\u24d5\u338a-\u338c\u3399\ufb00-\ufb04\uff26\uff46]',
    'G': '[Gg\u011c-\u0123\u01e6\u01e7\u01f4\u01f5\u1d33\u1d4d\u1e20\u1e21\u210a\u24a2\u24bc\u24d6\u32cc\u32cd\u3387\u338d-\u338f\u3393\u33ac\u33c6\u33c9\u33d2\u33ff\uff27\uff47]',
    'H': '[Hh\u0124\u0125\u021e\u021f\u02b0\u1d34\u1e22-\u1e2b\u1e96\u210b-\u210e\u24a3\u24bd\u24d7\u32cc\u3371\u3390-\u3394\u33ca\u33cb\u33d7\uff28\uff48]',
    'I': '[Ii\xcc-\xcf\xec-\xef\u0128-\u0130\u0132\u0133\u01cf\u01d0\u0208-\u020b\u1d35\u1d62\u1e2c\u1e2d\u1ec8-\u1ecb\u2071\u2110\u2111\u2139\u2148\u2160-\u2163\u2165-\u2168\u216a\u216b\u2170-\u2173\u2175-\u2178\u217a\u217b\u24a4\u24be\u24d8\u337a\u33cc\u33d5\ufb01\ufb03\uff29\uff49]',
    'J': '[Jj\u0132-\u0135\u01c7-\u01cc\u01f0\u02b2\u1d36\u2149\u24a5\u24bf\u24d9\u2c7c\uff2a\uff4a]',
    'K': '[Kk\u0136\u0137\u01e8\u01e9\u1d37\u1d4f\u1e30-\u1e35\u212a\u24a6\u24c0\u24da\u3384\u3385\u3389\u338f\u3391\u3398\u339e\u33a2\u33a6\u33aa\u33b8\u33be\u33c0\u33c6\u33cd-\u33cf\uff2b\uff4b]',
    'L': '[Ll\u0139-\u0140\u01c7-\u01c9\u02e1\u1d38\u1e36\u1e37\u1e3a-\u1e3d\u2112\u2113\u2121\u216c\u217c\u24a7\u24c1\u24db\u32cf\u3388\u3389\u33d0-\u33d3\u33d5\u33d6\u33ff\ufb02\ufb04\uff2c\uff4c]',
    'M': '[Mm\u1d39\u1d50\u1e3e-\u1e43\u2120\u2122\u2133\u216f\u217f\u24a8\u24c2\u24dc\u3377-\u3379\u3383\u3386\u338e\u3392\u3396\u3399-\u33a8\u33ab\u33b3\u33b7\u33b9\u33bd\u33bf\u33c1\u33c2\u33ce\u33d0\u33d4-\u33d6\u33d8\u33d9\u33de\u33df\uff2d\uff4d]',
    'N': '[Nn\xd1\xf1\u0143-\u0149\u01ca-\u01cc\u01f8\u01f9\u1d3a\u1e44-\u1e4b\u207f\u2115\u2116\u24a9\u24c3\u24dd\u3381\u338b\u339a\u33b1\u33b5\u33bb\u33cc\u33d1\uff2e\uff4e]',
    'O': '[Oo\xba\xd2-\xd6\xf2-\xf6\u014c-\u0151\u01a0\u01a1\u01d1\u01d2\u01ea\u01eb\u020c-\u020f\u022e\u022f\u1d3c\u1d52\u1ecc-\u1ecf\u2092\u2105\u2116\u2134\u24aa\u24c4\u24de\u3375\u33c7\u33d2\u33d6\uff2f\uff4f]',
    'P': '[Pp\u1d3e\u1d56\u1e54-\u1e57\u2119\u24ab\u24c5\u24df\u3250\u3371\u3376\u3380\u338a\u33a9-\u33ac\u33b0\u33b4\u33ba\u33cb\u33d7-\u33da\uff30\uff50]',
    'Q': '[Qq\u211a\u24ac\u24c6\u24e0\u33c3\uff31\uff51]',
    'R': '[Rr\u0154-\u0159\u0210-\u0213\u02b3\u1d3f\u1d63\u1e58-\u1e5b\u1e5e\u1e5f\u20a8\u211b-\u211d\u24ad\u24c7\u24e1\u32cd\u3374\u33ad-\u33af\u33da\u33db\uff32\uff52]',
    'S': '[Ss\u015a-\u0161\u017f\u0218\u0219\u02e2\u1e60-\u1e63\u20a8\u2101\u2120\u24ae\u24c8\u24e2\u33a7\u33a8\u33ae-\u33b3\u33db\u33dc\ufb06\uff33\uff53]',
    'T': '[Tt\u0162-\u0165\u021a\u021b\u1d40\u1d57\u1e6a-\u1e71\u1e97\u2121\u2122\u24af\u24c9\u24e3\u3250\u32cf\u3394\u33cf\ufb05\ufb06\uff34\uff54]',
    'U': '[Uu\xd9-\xdc\xf9-\xfc\u0168-\u0173\u01af\u01b0\u01d3\u01d4\u0214-\u0217\u1d41\u1d58\u1d64\u1e72-\u1e77\u1ee4-\u1ee7\u2106\u24b0\u24ca\u24e4\u3373\u337a\uff35\uff55]',
    'V': '[Vv\u1d5b\u1d65\u1e7c-\u1e7f\u2163-\u2167\u2173-\u2177\u24b1\u24cb\u24e5\u2c7d\u32ce\u3375\u33b4-\u33b9\u33dc\u33de\uff36\uff56]',
    'W': '[Ww\u0174\u0175\u02b7\u1d42\u1e80-\u1e89\u1e98\u24b2\u24cc\u24e6\u33ba-\u33bf\u33dd\uff37\uff57]',
    'X': '[Xx\u02e3\u1e8a-\u1e8d\u2093\u213b\u2168-\u216b\u2178-\u217b\u24b3\u24cd\u24e7\u33d3\uff38\uff58]',
    'Y': '[Yy\xdd\xfd\xff\u0176-\u0178\u0232\u0233\u02b8\u1e8e\u1e8f\u1e99\u1ef2-\u1ef9\u24b4\u24ce\u24e8\u33c9\uff39\uff59]',
    'Z': '[Zz\u0179-\u017e\u01f1-\u01f3\u1dbb\u1e90-\u1e95\u2124\u2128\u24b5\u24cf\u24e9\u3390-\u3394\uff3a\uff5a]'
};

这篇关于JavaScript中的程序口音减少(又称文本规范化或不合理)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆