将Unicode字符转换为等效的ASCII字符 [英] Converting Unicode characters into the equivalent ASCII ones

查看:160
本文介绍了将Unicode字符转换为等效的ASCII字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要展平一些Unicode字符串,以便进行索引和搜索。例如,我需要将GötheФ€转换为ASCII。最后两个字符在ASCII中没有紧密表示,所以它完全丢弃它是好的。所以我希望从

I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from

echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");

Gothe c $ c> Gothe?EUR 。

is Gothe but instead it outputs Gothe?EUR.

除了字母,我还喜欢所有各种Unicode数字和标点符号,如$ ASCII // TRANSLIT // IGNORE 中替换它们最接近的ASCII对应,例如句点,逗号,破折号, iconv 函数已经,但是没有生成一些无法找到任何ASCII替换的Unicode字符的垃圾输出。我希望这些字符被完全忽略。

In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.

如何获得预期的结果?是否有更好的方法,可能使用 intl 库?

How do get the expected result? Is there a better way, perhaps using intl library?

推荐答案

选择了一个困难的问题。最好告诉用户输入Unicode字符来自己音译ASCII。

You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.

任何你所做的事情都可能对那些对Diacritics有深刻意义的人产生不满和厌恶: http://en.wikipedia.org/wiki/Diacritic

Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic

无论你使用什么音译策略,你都不会欢迎每个人,因为不同的人对不同的角色规定了不同的含义。喜欢一个人的音译会激怒另一个人。

No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.

但是生活是刺耳的和令人反感的,所以我们走了:

But life is jarring and offensive, so off we go:

这个PHP代码:

function toASCII( $str )
{
    return strtr(utf8_decode($str),
        utf8_decode(
        'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
        'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}

上述PHP函数的作用是替换第一个参数中的每个Unicode字符utf8_decode,并将其替换为utf8_decode的第二个参数中的相应字符。

What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.

例如Unicode À到ASCII A ,并将å转换为 a 。你必须为每一个你相信音译为ASCII字符的Unicode字符指定它。对于其他人,请删除它们或通过另一种音译算法运行它们。

For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.

还有95,221个其他字符,您必须查看,可能音译为ASCII。它成为一个存在的游戏当是 A 不再是 A ?。克林贡人物和路线图标志,那种看起来像A?鱼的字符类型看起来像一个 a 。谁是说什么是什么?

There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?

这是一个很大的工作,但如果你正在清理数据库输入,你必须创建一个白色的字符列表,其他野蛮人,保持他们在护城河,这是唯一可靠的方式。

This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.

这篇关于将Unicode字符转换为等效的ASCII字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆