将任何可转换的utf8字符音译为ascii等效项 [英] Transliterate any convertible utf8 char into ascii equivalent
问题描述
有没有好的解决方案可以很好地完成音译工作?
Is there any good solution out there that does this transliteration in a good manner?
我尝试使用iconv()
,但是很烦人,它的行为并不像人们期望的那样.
I've tried using iconv()
, but is very annoying and it does not behave as one might expect.
- 使用
//TRANSLIT
将尝试替换它可以替换的内容,并使所有内容都不能转换为?". - 使用
//IGNORE
不会留下?"在文本中,但也不会音译,并且在发现不可转换的char时也会升起E_NOTICE
,因此您必须将iconv与@错误抑制器一起使用 - 使用
//IGNORE//TRANSLIT
(正如某些人在PHP论坛中所建议的)实际上与//IGNORE
相同(我自己在PHP版本5.3.2和5.3.13上进行了尝试) - 也使用
//TRANSLIT//IGNORE
与//TRANSLIT
- Using
//TRANSLIT
will try to replace what it can, leaving everything nonconvertible as "?" - Using
//IGNORE
will not leave "?" in text, but will also not transliterate and will also raiseE_NOTICE
when nonconvertible char is found, so you have to use iconv with @ error suppressor - Using
//IGNORE//TRANSLIT
(as some people suggested in PHP forum) is actually same as//IGNORE
(tried it myself on php versions 5.3.2 and 5.3.13) - Also using
//TRANSLIT//IGNORE
is same as//TRANSLIT
它也使用当前的语言环境设置进行音译.
It also uses current locale settings to transliterate.
警告-大量文本和代码在后面!
以下是一些示例:
$text = 'Regular ascii text + čćžšđ + äöüß + éĕěėëȩ + æø€ + $ + ¶ + @';
echo '<br />original: ' . $text;
echo '<br />regular: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> regular: Regular ascii text + ????? + ???ss + ?????? + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'en_GB');
echo '<br />en_GB: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'en_GB.UTF8'); // will this work?
echo '<br />en_GB.UTF8: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB.UTF8: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
好,那确实转换了čćäööüßéěėė和æ,但为什么不not和ø?
Ok, that did convert č ć š ä ö ü ß é ĕ ě ė ë ȩ and æ, but why not đ and ø?
// now specific locales
setlocale(LC_ALL, 'hr_Hr'); // this should fix croatian đ, right?
echo '<br />hr_Hr: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// wrong > hr_Hr: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
setlocale(LC_ALL, 'sv_SE'); // so this will fix swedish ø?
echo '<br />sv_SE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// will not > sv_SE: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
//this is interesting
setlocale(LC_ALL, 'de_DE');
echo '<br />de_DE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> de_DE: Regular ascii text + cczs? + aeoeuess + eeeeee + ae?EUR + $ + ? + @
// actually this is what any german would expect since ä ö ü really is same as ae oe ue
让我们尝试使用//IGNORE
:
echo '<br />ignore: ' . iconv("UTF-8", "ASCII//IGNORE", $text);
//> ignore: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 49"
// with translit?
echo '<br />ignore/translit: ' . iconv("UTF-8", "ASCII//IGNORE//TRANSLIT", $text);
//same as ignore only> ignore/translit: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 54"
// translit/ignore?
echo '<br />translit/ignore: ' . iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $text);
//same as translit only> translit/ignore: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @
使用此人的解决方案也无法按需工作:Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + � + @
Using solution of this guy also does not work as wanted: Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + � + @
即使使用PECL intl Normalizer 类(也无法唤醒)即使您的PHP> 5.3.0,也总是如此,因为ICU包intl使用可能不适用于PHP(即在某些托管服务器上)会产生错误的结果:
Even using PECL intl Normalizer class (which is not awailable always even if you have PHP > 5.3.0, since ICU package intl uses may not be available to PHP i.e. on certain hosting servers) produces wrong result:
echo '<br />normalize: ' .preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD));
//>normalize: Regular ascii text + cczsđ + aouß + eeeeee + æø€ + $ + ¶ + @
那么还有其他方法可以做到这一点,或者要做的唯一正确的事情就是自己做preg_replace()
或str_replace()
并定义音译表?
So is there any other way of doing this right or the only proper thing to do is to do preg_replace()
or str_replace()
and define transliteration tables yourself?
//附录: 我在2008年的ZF Wiki辩论中发现有关有关Zend_Filter_Transliterate的建议但由于某些语言无法进行转换(即中文),因此项目被放弃了,但是对于任何基于拉丁文和西里尔文的IMO,该选项仍应存在.
// appendix: I have found on ZF wiki debate from 2008 about proposal for Zend_Filter_Transliterate but project was dropped since in some languages it is not possible to convert (i.e. chinese), but still for any latin- and cyrilic-based language IMO this option should exist.
推荐答案
Patchwork \ Utf8的toAscii()函数正是这样做的,请参见:
The toAscii() function of Patchwork\Utf8 does exactly this, see:
https://github.com /nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php
它利用iconv和intl的Normalizer来消除重音,拆分连字并进行许多其他普通音译.
It leverages iconv and intl's Normalizer to remove accents, split ligatures and do many other generic transliterations.
这篇关于将任何可转换的utf8字符音译为ascii等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!