重音不敏感子串匹配 [英] Accent-insensitive substring matching

查看:86
本文介绍了重音不敏感子串匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个搜索功能,可以从InnoDB表(utf8_spanish_ci归类)中获取数据,并将其显示在HTML文档中(UTF-8字符集).用户键入一个子字符串并获取匹配项列表,其中突出显示第一个子字符串,例如:

I have a search functionality that obtains data from an InnoDB table (utf8_spanish_ci collation) and displays it in an HTML document (UTF-8 charset). The user types a substring and obtains a list of matches where the first substring occurrence is highlighted, e.g.:

Matches for "AL":

Álava
<strong>Al</strong>bacete
<strong>Al</strong>mería
Ciudad Re<strong>al</strong>
Málaga

从示例中可以看到,搜索将忽略大小写和重音差异(MySQL会自动处理).但是,我用来突出显示匹配项的代码无法完成后者:

As you can see from the example, the search ignores both case and accent differences (MySQL takes care of it automatically). However, the code I'm using to hightlight matches fails to do the latter:

<?php

private static function highlightTerm($full_string, $match){
    $start = mb_stripos($full_string, $match);
    $length = mb_strlen($match);

    return
        htmlspecialchars( mb_substr($full_string, 0, $start)) .
        '<strong>' . htmlspecialchars( mb_substr($full_string, $start, $length) ) . '</strong>' .
        htmlspecialchars( mb_substr($full_string, $start+$length) );
}

?>

是否有一种明智的解决方法,并不意味着对所有可能的变体进行硬编码?

Is there a sensible way to fix this that doesn't imply hard-coding all possible variations?

更新:系统规格为PHP/5.2.14和MySQL/5.1.48

Update: System specs are PHP/5.2.14 and MySQL/5.1.48

推荐答案

您可以使用规范化器来规范化字符串到规范化形式KD(NFKD),其中字符将被分解,因此Á( U + 00C1)分解为字母A(U + 0041)和组合标记́(U + 0301)的组合:

You could use the Normalizer to normalize the string to Normalization Form KD (NFKD) where the characters are getting decomposed, so Á (U+00C1) is getting decomposed to the combination of the letter A (U+0041) and the combining mark ́ (U+0301):

$str = Normalizer::normalize($str, Normalizer::FORM_KD);

然后,您修改搜索模式以匹配那些可选标记:

Then you modify the search pattern to match those optional marks:

$pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui';

然后用preg_replace替换:

preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str))

因此完整的方法是:

private static function highlightTerm($str, $term) {
    $str = Normalizer::normalize($str, Normalizer::FORM_KD);
    $pattern = '/('.preg_replace('/\p{L}/u', '$0\p{Mn}?', preg_quote($term, '/')).')/ui';
    return preg_replace($pattern, '<strong>$0</strong>', htmlspecialchars($str));
}

这篇关于重音不敏感子串匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆