如何音译非拉丁文字? [英] How to transliterate non-latin scripts?

查看:89
本文介绍了如何音译非拉丁文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 iconv 在PHP中进行音译.特别是我想规范化带重音符号的字符并将其他脚本从UTF-8罗马化为纯ASCII.

I'm playing around with transliteration in PHP using iconv. Particularly I want to normalise accented characters and Romanize other scripts from UTF-8 to plain ASCII.

尽管许多字符可以工作(例如Ž-> Z),但其他字符却给出了奇怪的结果或引发了错误.

While many characters work, (such as Ž->Z) others are giving odd results or raising errors.

例如,E ACUTE é(U + 00E9)在e之前用单引号(U + 0027)转换为ASCII,就好像它试图代表我要摆脱的变音符号一样

For example, E ACUTE é (U+00E9) transliterates to ASCII with a single quote (U+0027) preceding the e as if it's trying to represent the diacritic mark I'm trying to get rid of.

$utf_8 = "\xC3\xA9"; // <- é
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// returns "'e", not "e"

非拉丁文字更糟糕,例如,应译为拉丁文S的希腊sigma Σ(U + 03A3)根本无法识别,并引发错误:

Non-latin scripts are worse, for example Greek sigma Σ (U+03A3) which should transliterate to latin S is not recognised at all and raises an error:

$utf_8 = "\xCE\xA3"; // <- Σ
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// Raises notice: iconv(): Detected an illegal character in input string

我可以应付第一个,但是如何将Σ"音译为"S",并在具有相同字符的其他脚本中可靠地做到这一点?

I can just about cope with the first one, but how can I transliterate "Σ" to "S", and do this reliably across other scripts that have equivalent characters?

如果有一个适用于大多数欧洲语言的良好资源,我不介意生成自己的表.

I don't mind generating my own tables if there is a good source that works for most european languages.

请注意,我已经尝试过各种归类表,这些表对于规范带重音的拉丁字符非常有用,但它们不适用于脚本之间的音译.

Note that I've tried various collation tables, which are useful for normalising accented latin characters, but they don't work for transliterating between scripts.

推荐答案

我已经

I've attempted something similar - it's mainly based off Doctrine 1 code and isn't perfect: but it seemed to work with all the test data I threw at it.

这篇关于如何音译非拉丁文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆