latin-1 到 ascii [英] latin-1 to ascii

查看:36
本文介绍了latin-1 到 ascii的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有重音拉丁字符的 unicode 字符串,例如

I have a unicode string with accented latin chars e.g.

n=unicode('Wikipédia, le projet d’encyclopédie','utf-8')

我想将它转换为普通的 ascii,即维基百科,le projet dencyclopedie",所以所有的尖刻/重音,塞迪拉等都应该被删除

I want to convert it to plain ascii i.e. 'Wikipedia, le projet dencyclopedie', so all acute/accent,cedilla etc should get removed

最快的方法是什么,因为匹配一个长的自动完成下拉列表需要这样做

What is the fastest way to do that, as it needed to be done for matching a long autocomplete dropdown list

结论:由于我的标准之一是速度,Lennart 的为 unicode 编码/解码注册您自己的错误处理程序"给出了最佳结果(参见 Alex 的回答),随着越来越多的拉丁字符,速度差异进一步增加.

Conclusion: As one my criteria is speed, Lennart's 'register your own error handler for unicode encoding/decoding' gives best result (see Alex's answer), speed difference increases further as more and more chars are latin.

这是我正在使用的转换表,还修改了错误处理程序,因为它需要处理从 error.start 到 error.end 的整个未编码字符范围

Here is the translation table I am using, also modified error handler as it need to take care of whole range of un-encoded char from error.start to error.end

# -*- coding: utf-8 -*-
import codecs

"""
This is more of visual translation also avoiding multiple char translation
e.g. £ may be written as {pound}
"""
latin_dict = {
u"¡": u"!", u"¢": u"c", u"£": u"L", u"¤": u"o", u"¥": u"Y",
u"¦": u"|", u"§": u"S", u"¨": u"`", u"©": u"c", u"ª": u"a",
u"«": u"<<", u"¬": u"-", u"­": u"-", u"®": u"R", u"¯": u"-",
u"°": u"o", u"±": u"+-", u"²": u"2", u"³": u"3", u"´": u"'",
u"µ": u"u", u"¶": u"P", u"·": u".", u"¸": u",", u"¹": u"1",
u"º": u"o", u"»": u">>", u"¼": u"1/4", u"½": u"1/2", u"¾": u"3/4",
u"¿": u"?", u"À": u"A", u"Á": u"A", u"Â": u"A", u"Ã": u"A",
u"Ä": u"A", u"Å": u"A", u"Æ": u"Ae", u"Ç": u"C", u"È": u"E",
u"É": u"E", u"Ê": u"E", u"Ë": u"E", u"Ì": u"I", u"Í": u"I",
u"Î": u"I", u"Ï": u"I", u"Ð": u"D", u"Ñ": u"N", u"Ò": u"O",
u"Ó": u"O", u"Ô": u"O", u"Õ": u"O", u"Ö": u"O", u"×": u"*",
u"Ø": u"O", u"Ù": u"U", u"Ú": u"U", u"Û": u"U", u"Ü": u"U",
u"Ý": u"Y", u"Þ": u"p", u"ß": u"b", u"à": u"a", u"á": u"a",
u"â": u"a", u"ã": u"a", u"ä": u"a", u"å": u"a", u"æ": u"ae",
u"ç": u"c", u"è": u"e", u"é": u"e", u"ê": u"e", u"ë": u"e",
u"ì": u"i", u"í": u"i", u"î": u"i", u"ï": u"i", u"ð": u"d",
u"ñ": u"n", u"ò": u"o", u"ó": u"o", u"ô": u"o", u"õ": u"o",
u"ö": u"o", u"÷": u"/", u"ø": u"o", u"ù": u"u", u"ú": u"u",
u"û": u"u", u"ü": u"u", u"ý": u"y", u"þ": u"p", u"ÿ": u"y", 
u"’":u"'"}

def latin2ascii(error):
    """
    error is  protion of text from start to end, we just convert first
    hence return error.start+1 instead of error.end
    """
    return latin_dict[error.object[error.start]], error.start+1

codecs.register_error('latin2ascii', latin2ascii)

if __name__ == "__main__":
    x = u"¼ éíñ§ÐÌëÑ » ¼ ö ® © ’"
    print x
    print x.encode('ascii', 'latin2ascii')

为什么我返回error.start + 1:

返回的错误对象可以是多个字符,我们只转换其中的第一个,例如如果我将 print error.start, error.end 添加到错误处理程序输出是

error object returned can be multiple characters, and we convert only first of these e.g. if I add print error.start, error.end to error handler output is

¼ éíñ§ÐÌëÑ » ¼ ö ® © ’
0 1
2 10
3 10
4 10
5 10
6 10
7 10
8 10
9 10
11 12
13 14
15 16
17 18
19 20
21 22
1/4 einSDIeN >> 1/4 o R c '

所以在第二行我们得到了 2-10 的字符,但是我们只转换了 2nd 因此返回 3 作为继续点,如果我们返回 error.end 输出是

so in second line we get chars from 2-10 but we convert only 2nd hence return 3 as continue point, if we return error.end output is

¼ éíñ§ÐÌëÑ » ¼ ö ® © ’
0 1
2 10
11 12
13 14
15 16
17 18
19 20
21 22
1/4 e >> 1/4 o R c '

如我们所见,2-10 部分已被单个字符替换.当然,一次性编码整个范围并返回 error.end 会更快,但为了演示目的,我保持简单.

As we can see 2-10 portion has been replaced by a single char. off-course it would be faster to just encode whole range in one go and return error.end, but for demonstration purpose I have kept it simple.

http://docs.python.org/library/codecs.html#codecs.register_error 了解更多详情

推荐答案

所以这里有三种方法,或多或少是在其他答案中给出或建议的:

So here are three approaches, more or less as given or suggested in other answers:

# -*- coding: utf-8 -*-
import codecs
import unicodedata

x = u"Wikipédia, le projet d’encyclopédie"

xtd = {ord(u'’'): u"'", ord(u'é'): u'e', }

def asciify(error):
    return xtd[ord(error.object[error.start])], error.end

codecs.register_error('asciify', asciify)

def ae():
  return x.encode('ascii', 'asciify')

def ud():
  return unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore')

def tr():
  return x.translate(xtd)

if __name__ == '__main__':
  print 'or:', x
  print 'ae:', ae()
  print 'ud:', ud()
  print 'tr:', tr()

作为主运行,这会发出:

Run as main, this emits:

or: Wikipédia, le projet d’encyclopédie
ae: Wikipedia, le projet d'encyclopedie
ud: Wikipedia, le projet dencyclopedie
tr: Wikipedia, le projet d'encyclopedie

清楚地表明,基于 unicodedata 的方法虽然确实具有不需要翻译映射 xtd 的便利,但不能以自动方式正确翻译所有字符(它适用于重音字母但不是用于反撇号),因此它还需要一些辅助步骤来明确处理那些(毫无疑问在现在它的主体之前).

showing clearly that the unicodedata-based approach, while it does have the convenience of not needing a translation map xtd, can't translate all characters properly in an automated fashion (it works for accented letters but not for the reverse-apostrophe), so it would also need some auxiliary step to deal explicitly with those (no doubt before what's now its body).

性能也很有趣.在我装有 Mac OS X 10.5 和系统 Python 2.5 的笔记本电脑上,非常重复:

Performance is also interesting. On my laptop with Mac OS X 10.5 and system Python 2.5, quite repeatably:

$ python -mtimeit -s'import a' 'a.ae()'
100000 loops, best of 3: 7.5 usec per loop
$ python -mtimeit -s'import a' 'a.ud()'
100000 loops, best of 3: 3.66 usec per loop
$ python -mtimeit -s'import a' 'a.tr()'
10000 loops, best of 3: 21.4 usec per loop

translate 出奇的慢(相对于其他方法).我相信问题在于 translate 案例中的每个字符都会查看 dict(大多数都不存在),但仅针对那些带有 asciify方法.

translate is surprisingly slow (relative to the other approaches). I believe the issue is that the dict is looked into for every character in the translate case (and most are not there), but only for those few characters that ARE there with the asciify approach.

为了完整起见,这里是增强的 unicodedata"方法:

So for completeness here's "beefed-up unicodedata" approach:

specstd = {ord(u'’'): u"'", }
def specials(error):
  return specstd.get(ord(error.object[error.start]), u''), error.end
codecs.register_error('specials', specials)

def bu():
  return unicodedata.normalize('NFKD', x).encode('ASCII', 'specials')

这给出了正确的输出,但是:

this gives the right output, BUT:

$ python -mtimeit -s'import a' 'a.bu()'
100000 loops, best of 3: 10.7 usec per loop

...速度不再那么好.因此,如果速度很重要,那么制作完整的 xtd 翻译字典并使用 asciify 方法无疑是值得的.当每次翻译多出几微秒没什么大不了的时候,人们可能想考虑 bu 方法只是为了它的方便(只需要一个翻译字典,希望很少,不能正确翻译的特殊字符与底层的 unicodedata 想法).

...speed isn't all that good any more. So, if speed matters, it's no doubt worth the trouble of making a complete xtd translation dict and using the asciify approach. When a few extra microseconds per translation are no big deal, one might want to consider the bu approach simply for its convenience (only needs a translation dict for, hopefully few, special characters that don't translate correctly with the underlying unicodedata idea).

这篇关于latin-1 到 ascii的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆