如何将扩展的拉丁字符更改为其无重音的 ASCII 等效字符? [英] How can I change extended latin characters to their unaccented ASCII equivalents?
问题描述
我需要一个通用的音译或替换正则表达式,它将扩展拉丁字符映射到类似的 ASCII 字符,并将所有其他扩展字符映射到 ''(空字符串),以便...
I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to '' (empty string) so that...
é 变成 e
é becomes e
ê 变成了 e
á 变成了
ç 变成 c
Ď 变成 D
等等,但是像‡ 或Ω 或‰ 之类的东西会被删除.
and so on, but things like ‡ or Ω or ‰ just get striped away.
推荐答案
所有精彩的答案.但没有一个真正有效.在终端窗口或跨平台的各种代码/文本编辑器中工作时,将扩展字符直接放在源代码中会导致问题.我能够尝试 Unicode::Normalize、Text::Unidecode 和 Text::Unaccent,但无法让它们中的任何一个完全按照我的意愿去做.
All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.
最后,我只是列举了所有我想为 UTF-8(这是我输入数据中最常见的代码页)音译的字符.
In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).
我需要两个额外的替换来处理我想映射到两个字符的 æ 和 Æ
I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters
对于感兴趣的各方,最终代码是:(tr 是单行)
For interested parties the final code is: (the tr is a single line)
$word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
\xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
oooouuuuyy/;
$word =~ s/\xC6/AE/g;
$word =~ s/\xE6/ae/g;
$word =~ s/[^\x00-\x7F]+//g;
由于像 Ď 这样的东西不是 UTF-8 的一部分,它们在我的输入数据中几乎不会经常出现.对于非 UTF-8 输入,我选择只放掉 127 以上的所有内容.
Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.
这篇关于如何将扩展的拉丁字符更改为其无重音的 ASCII 等效字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!