如何在python中对iso-8859-15转换的unicode编码进行归一化? [英] How to normalize unicode encoding for iso-8859-15 conversion in python?
问题描述
我想将unicode字符串转换为iso-8859-15。这些字符串包括 u\\\’
(RIGHT SINGLE QUOTATION MARK参见 http://www.fileformat.info/info/unicode/char/2019/index.htm )不属于iso-8859-15的字符字符集。
在Python中,如何标准化unicode字符以匹配iso-8859-15编码?
我已经看到unicodedata模块没有成功。我设法使用
s.replace(u\\\’,').encode(' iso-8859-15')
但我想找到一个更一般和更干净的方式。 / p>
感谢您的帮助
使用unicode版本的< a href =http://docs.python.org/library/stdtypes.html#str.translate =nofollow> translate
功能,假设 s
是一个unicode字符串:
s.translate({ord (u\\\’):ord(u')})
unicode版本的 translate
是一个dict映射unicode序号到unicode ordinals。添加到此dict其他字符,您不能编码在您的目标编码。
您可以以更易读的形式构建映射表,并从中创建映射表,实例:
char_mappings = [(u\\\’,u'),
(u ',u')]
translate_mapping = {ord(k):ord(v)for k,v in char_mappings}
从翻译文档:
对于Unicode对象,translate()方法不接受
可选的deletechars参数。相反,它返回一个s
的副本,其中所有字符都通过给定的翻译
表映射,该表必须是Unicode序数到Unicode序号的映射,
Unicode字符串或无。未映射的字符保持不变。
映射到无的字符将被删除。注意,一个更灵活的方法
是使用编解码器模块
创建一个自定义字符映射编解码器(参见encodings.cp1251作为示例)。
I want to convert unicode string into iso-8859-15. These strings include the u"\u2019"
(RIGHT SINGLE QUOTATION MARK see http://www.fileformat.info/info/unicode/char/2019/index.htm) character which is not part of the iso-8859-15 characters set.
In Python, how to normalize the unicode characters in order to match the iso-8859-15 encoding?
I have looked at the unicodedata module without success. I manage to do the job with
s.replace(u"\u2019", "'").encode('iso-8859-15')
but I would like to find a more general and cleaner way.
Thanks for your help
Use the unicode version of the translate
function, assuming s
is a unicode string:
s.translate({ord(u"\u2019"):ord(u"'")})
The argument of the unicode version of translate
is a dict mapping unicode ordinals to unicode ordinals. Add to this dict other characters you cannot encode in your target encoding.
You can build your mapping table in a little more readable form and create your mapping dict from it, for instance:
char_mappings = [(u"\u2019", u"'"),
(u"`", u"'")]
translate_mapping = {ord(k):ord(v) for k,v in char_mappings}
From translate documentation:
For Unicode objects, the translate() method does not accept the optional deletechars argument. Instead, it returns a copy of the s where all characters have been mapped through the given translation table which must be a mapping of Unicode ordinals to Unicode ordinals, Unicode strings or None. Unmapped characters are left untouched. Characters mapped to None are deleted. Note, a more flexible approach is to create a custom character mapping codec using the codecs module (see encodings.cp1251 for an example).
这篇关于如何在python中对iso-8859-15转换的unicode编码进行归一化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!