如何在python中对iso-8859-15转换的unicode编码进行归一化? [英] How to normalize unicode encoding for iso-8859-15 conversion in python?

查看:224
本文介绍了如何在python中对iso-8859-15转换的unicode编码进行归一化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将unicode字符串转换为iso-8859-15。这些字符串包括 u\\\’(RIGHT SINGLE QUOTATION MARK参见 http://www.fileformat.info/info/unicode/char/2019/index.htm )不属于iso-8859-15的字符字符集。



在Python中,如何标准化unicode字符以匹配iso-8859-15编码?



我已经看到unicodedata模块没有成功。我设法使用

  s.replace(u\\\’,').encode(' iso-8859-15')

但我想找到一个更一般和更干净的方式。 / p>

感谢您的帮助

解决方案

使用unicode版本的< a href =http://docs.python.org/library/stdtypes.html#str.translate =nofollow> translate 功能,假设 s 是一个unicode字符串:

  s.translate({ord (u\\\’):ord(u')})

unicode版本的 translate 是一个dict映射unicode序号到unicode ordinals。添加到此dict其他字符,您不能编码在您的目标编码。



您可以以更易读的形式构建映射表,并从中创建映射表,实例:

  char_mappings = [(u\\\’,u'),
(u ',u')]
translate_mapping = {ord(k):ord(v)for k,v in char_mappings}






从翻译文档:


对于Unicode对象,translate()方法不接受
可选的deletechars参数。相反,它返回一个s
的副本,其中所有字符都通过给定的翻译
表映射,该表必须是Unicode序数到Unicode序号的映射,
Unicode字符串或无。未映射的字符保持不变。
映射到无的字符将被删除。注意,一个更灵活的方法
是使用编解码器模块
创建一个自定义字符映射编解码器(参见encodings.cp1251作为示例)。



I want to convert unicode string into iso-8859-15. These strings include the u"\u2019" (RIGHT SINGLE QUOTATION MARK see http://www.fileformat.info/info/unicode/char/2019/index.htm) character which is not part of the iso-8859-15 characters set.

In Python, how to normalize the unicode characters in order to match the iso-8859-15 encoding?

I have looked at the unicodedata module without success. I manage to do the job with

s.replace(u"\u2019", "'").encode('iso-8859-15')

but I would like to find a more general and cleaner way.

Thanks for your help

解决方案

Use the unicode version of the translate function, assuming s is a unicode string:

s.translate({ord(u"\u2019"):ord(u"'")})

The argument of the unicode version of translate is a dict mapping unicode ordinals to unicode ordinals. Add to this dict other characters you cannot encode in your target encoding.

You can build your mapping table in a little more readable form and create your mapping dict from it, for instance:

char_mappings = [(u"\u2019", u"'"),
                 (u"`", u"'")]
translate_mapping = {ord(k):ord(v) for k,v in char_mappings}


From translate documentation:

For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example).

这篇关于如何在python中对iso-8859-15转换的unicode编码进行归一化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆