如何将扩展的拉丁字符更改为其无重音的 ASCII 等效字符? [英] How can I change extended latin characters to their unaccented ASCII equivalents?

查看:48
本文介绍了如何将扩展的拉丁字符更改为其无重音的 ASCII 等效字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个通用的音译或替换正则表达式,它将扩展拉丁字符映射到类似的 ASCII 字符,并将所有其他扩展字符映射到 ''(空字符串),以便...

I need a generic transliteration or substitution regex that will map extended latin characters to similar looking ASCII characters, and all other extended characters to '' (empty string) so that...

  • é 变成 e

  • é becomes e

ê 变成了 e

á 变成了

ç 变成 c

Ď 变成 D

等等,但是像‡ 或Ω 或‰ 之类的东西会被删除.

and so on, but things like ‡ or Ω or ‰ just get striped away.

推荐答案

所有精彩的答案.但没有一个真正有效.在终端窗口或跨平台的各种代码/文本编辑器中工作时,将扩展字符直接放在源代码中会导致问题.我能够尝试 Unicode::Normalize、Text::Unidecode 和 Text::Unaccent,但无法让它们中的任何一个完全按照我的意愿去做.

All brilliant answers. But none actually really worked. Putting extended characters directly in the source-code caused problems when working in terminal windows or various code/text editors across platforms. I was able to try out Unicode::Normalize, Text::Unidecode and Text::Unaccent, but wan't able to get any of them to do exactly what I want.

最后,我只是列举了所有我想为 UTF-8(这是我输入数据中最常见的代码页)音译的字符.

In the end I just enumerated all the characters I wanted transliterated myself for UTF-8 (which is most frequent code page found in my input data).

我需要两个额外的替换来处理我想映射到两个字符的 æ 和 Æ

I needed two extra substitutions to take care of æ and Æ which I want mapping to two characters

对于感兴趣的各方,最终代码是:(tr 是单行)

For interested parties the final code is: (the tr is a single line)

$word =~ tr/\xC0\xC1\xC2\xC3\xC4\xC5\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF
\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xDD\xE0\xE1\xE2\xE3\xE4
\xE5\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF8
\xF9\xFA\xFB\xFC\xFD\xFF/AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiionoo
oooouuuuyy/;
$word =~ s/\xC6/AE/g;
$word =~ s/\xE6/ae/g;
$word =~ s/[^\x00-\x7F]+//g;

由于像 Ď 这样的东西不是 UTF-8 的一部分,它们在我的输入数据中几乎不会经常出现.对于非 UTF-8 输入,我选择只放掉 127 以上的所有内容.

Since things like Ď are not part of UTF-8, they don't occur nearly so often in my input data. For non-UTF-8 input, I chose to just loose everything above 127.

这篇关于如何将扩展的拉丁字符更改为其无重音的 ASCII 等效字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆