将任何可转换的utf8字符音译为ascii等效项 [英] Transliterate any convertible utf8 char into ascii equivalent

查看:125
本文介绍了将任何可转换的utf8字符音译为ascii等效项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有好的解决方案可以很好地完成音译工作?

Is there any good solution out there that does this transliteration in a good manner?

我尝试使用iconv(),但是很烦人,它的行为并不像人们期望的那样.

I've tried using iconv(), but is very annoying and it does not behave as one might expect.

  • 使用//TRANSLIT将尝试替换它可以替换的内容,并使所有内容都不能转换为?".
  • 使用//IGNORE不会留下?"在文本中,但也不会音译,并且在发现不可转换的char时也会升起E_NOTICE,因此您必须将iconv与@错误抑制器一起使用
  • 使用//IGNORE//TRANSLIT(正如某些人在PHP论坛中所建议的)实际上与//IGNORE相同(我自己在PHP版本5.3.2和5.3.13上进行了尝试)
  • 也使用//TRANSLIT//IGNORE//TRANSLIT
  • Using //TRANSLIT will try to replace what it can, leaving everything nonconvertible as "?"
  • Using //IGNORE will not leave "?" in text, but will also not transliterate and will also raise E_NOTICE when nonconvertible char is found, so you have to use iconv with @ error suppressor
  • Using //IGNORE//TRANSLIT (as some people suggested in PHP forum) is actually same as //IGNORE (tried it myself on php versions 5.3.2 and 5.3.13)
  • Also using //TRANSLIT//IGNORE is same as //TRANSLIT

它也使用当前的语言环境设置进行音译.

It also uses current locale settings to transliterate.

警告-大量文本和代码在后面!

以下是一些示例:

$text = 'Regular ascii text + čćžšđ + äöüß + éĕěėëȩ + æø€ + $ + ¶ + @';
echo '<br />original: ' . $text;
echo '<br />regular: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> regular: Regular ascii text + ????? + ???ss + ?????? + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'en_GB');
echo '<br />en_GB: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'en_GB.UTF8'); // will this work?
echo '<br />en_GB.UTF8: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> en_GB.UTF8: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

好,那确实转换了čćäööüßéěėė和æ,但为什么不not和ø?

Ok, that did convert č ć š ä ö ü ß é ĕ ě ė ë ȩ and æ, but why not đ and ø?

// now specific locales
setlocale(LC_ALL, 'hr_Hr'); // this should fix croatian đ, right?
echo '<br />hr_Hr: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// wrong > hr_Hr: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

setlocale(LC_ALL, 'sv_SE'); // so this will fix swedish ø?
echo '<br />sv_SE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
// will not > sv_SE: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

//this is interesting
setlocale(LC_ALL, 'de_DE');
echo '<br />de_DE: ' . iconv("UTF-8", "ASCII//TRANSLIT", $text);
//> de_DE: Regular ascii text + cczs? + aeoeuess + eeeeee + ae?EUR + $ + ? + @
// actually this is what any german would expect since ä ö ü really is same as ae oe ue

让我们尝试使用//IGNORE:

echo '<br />ignore: ' . iconv("UTF-8", "ASCII//IGNORE", $text);
//> ignore: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 49"

// with translit?
echo '<br />ignore/translit: ' . iconv("UTF-8", "ASCII//IGNORE//TRANSLIT", $text);
//same as ignore only> ignore/translit: Regular ascii text + + + + + $ + + @
//+ E_NOTICE: "Notice: iconv(): Detected an illegal character in input string in /var/www/test.server.web/index.php on line 54"

// translit/ignore?
echo '<br />translit/ignore: ' . iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $text);
//same as translit only> translit/ignore: Regular ascii text + cczs? + aouss + eeeeee + ae?EUR + $ + ? + @

使用此人的解决方案也无法按需工作:Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + � + @

Using solution of this guy also does not work as wanted: Regular ascii text + YYYYY + aous + eYYYeY + aoY + $ + � + @

即使使用PECL intl Normalizer 类(也无法唤醒)即使您的PHP> 5.3.0,也总是如此,因为ICU包intl使用可能不适用于PHP(即在某些托管服务器上)会产生错误的结果:

Even using PECL intl Normalizer class (which is not awailable always even if you have PHP > 5.3.0, since ICU package intl uses may not be available to PHP i.e. on certain hosting servers) produces wrong result:

echo '<br />normalize: ' .preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD));
//>normalize: Regular ascii text + cczsđ + aouß + eeeeee + æø€ + $ + ¶ + @

那么还有其他方法可以做到这一点,或者要做的唯一正确的事情就是自己做preg_replace()str_replace()并定义音译表?

So is there any other way of doing this right or the only proper thing to do is to do preg_replace() or str_replace() and define transliteration tables yourself?

//附录: 我在2008年的ZF Wiki辩论中发现有关有关Zend_Filter_Transliterate的建议但由于某些语言无法进行转换(即中文),因此项目被放弃了,但是对于任何基于拉丁文和西里尔文的IMO,该选项仍应存在.

// appendix: I have found on ZF wiki debate from 2008 about proposal for Zend_Filter_Transliterate but project was dropped since in some languages it is not possible to convert (i.e. chinese), but still for any latin- and cyrilic-based language IMO this option should exist.

推荐答案

Patchwork \ Utf8的toAscii()函数正是这样做的,请参见:

The toAscii() function of Patchwork\Utf8 does exactly this, see:

https://github.com /nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php

它利用iconv和intl的Normalizer来消除重音,拆分连字并进行许多其他普通音译.

It leverages iconv and intl's Normalizer to remove accents, split ligatures and do many other generic transliterations.

这篇关于将任何可转换的utf8字符音译为ascii等效项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆