用Python用基本拉丁字母替换国际字符的好方法是什么? [英] What's a good way to replace international characters with their base Latin counterparts using Python?

查看:170
本文介绍了用Python用基本拉丁字母替换国际字符的好方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有一个字符串"blöt träbåt",其中有几个ao,上面有变音符号和铃声.我希望它尽可能简单地成为"blot trabat".我做了一些挖掘,发现了以下方法:

import unicodedata
unicode_string = unicodedata.normalize('NFKD', unicode(string))

这将为我提供unicode格式的字符串,将国际字符分成基本字母并组合字符(用于变音符的组合为\u0308).现在,将其恢复为ASCII字符串,我可以执行ascii_string = unicode_string.encode('ASCII', 'ignore'),它将只需忽略组合字符,得到字符串"blot trabat".

这里的问题是:还有更好的方法吗?感觉就像是一个回旋处,我一直在想可能有些我不知道的事情.我当然可以将其包装在一个辅助函数中,但是我宁愿检查一下它是否在Python中还不存在.

解决方案

最好先创建一个显式表,然后再使用unicode.translate方法.优点是音译更精确,例如应将德语中的ö"译为"oe",将ß"译为"ss",

PyPI上有几种音译软件包: translitcodec which has a few a and o with umlaut and ring above. I want it to become "blot trabat" as simply as possibly. I've done some digging and found the following method:

import unicodedata
unicode_string = unicodedata.normalize('NFKD', unicode(string))

This will give me the string in unicode format with the international characters split into base letter and combining character (\u0308 for umlauts.) Now to get this back to an ASCII string I could do ascii_string = unicode_string.encode('ASCII', 'ignore') and it'll just ignore the combining characters, resulting in the string "blot trabat".

The question here is: is there a better way to do this? It feels like a roundabout way, and I was thinking there might be something I don't know about. I could of course wrap it up in a helper function, but I'd rather check if this doesn't exist in Python already.

解决方案

It would be better if you created an explicit table, and then used the unicode.translate method. The advantage would be that transliteration is more precise, e.g. transliterating "ö" to "oe" and "ß" to "ss", as should be done in German.

There are several transliteration packages on PyPI: translitcodec, Unidecode, and trans.

这篇关于用Python用基本拉丁字母替换国际字符的好方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆