在Python 2 + GTK中检测/删除不成对的代理字符 [英] detect/remove unpaired surrogate character in Python 2 + GTK

查看:179
本文介绍了在Python 2 + GTK中检测/删除不成对的代理字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python 2.7中,我可以成功地将Unicode字符串abc\\\�xyz转换为UTF-8(结果是abc\\xxed \xb0\xb4xyz)。但是当我将UTF-8字符串传递给例如。 pango_parse_markup() g_convert_with_fallback(),我得到像转换输入中的字节序列无效的错误。显然,GTK / Pango函数检测字符串中的unpaired surrogate,并正确地拒绝它。



Python 3甚至不允许转换Unicode字符串(错误:'utf-8'编解码器不能编码字符'\\\�'在位置3:代理不允许),但我可以运行abc\\\�xyz .encode(utf8,替换)得到一个有效的UTF8字符串,并用其他字符替换单独的替代项。这对我来说很好,但我需要Python 2的解决方案。



所以问题是:在Python 2.7中,如何将该Unicode字符串转换为UTF-8用一些替换字符替换单独的替代字符如U + FFFD?最好只使用标准的Python函数和GTK / GLib / G ...函数。

顺便说一下。 Iconv可以将字符串转换为UTF8,但只是删除坏字符,而不是用U + FFFD替换。

解决方案

编码前自己替换:

  import re 

lone = re.compile(
ur'''(?x)#verbose expression(允许注释)
(#开始组
[\\\�-\\\�]#匹配前导代理
(?![\ udc00 -\\\\\\\\\\\\\\\')#但是只有当后面的代理商没有跟随
)#end group
|#
(#begin group
(?<![如果没有前导代理
[\\\�-\\\�]#匹配尾部代理
)#结束组
''')

u = u'abc\\\�\\\�\\\�xyz'
print repr(u)
b = lone.sub(ur '\\\�',u).encode('utf8')
print repr(b)
print repr(b.decode('utf8'))



输出:

  u'abc\ ud834\U0001abcdxyz'
'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz'
u'abc\\\�\U0001abcdxyz'


In Python 2.7 I can successfully convert the Unicode string "abc\udc34xyz" to UTF-8 (result is "abc\xed\xb0\xb4xyz"). But when I pass the UTF-8 string to eg. pango_parse_markup() or g_convert_with_fallback(), I get errors like "Invalid byte sequence in conversion input". Apparently the GTK/Pango functions detect the "unpaired surrogate" in the string and (correctly?) reject it.

Python 3 doesn't even allow conversion of the Unicode string to UTF-8 (error: "'utf-8' codec can't encode character '\udc34' in position 3: surrogates not allowed"), but I can run "abc\udc34xyz".encode("utf8", "replace") to get a valid UTF8 string with the lone surrogate replaced by some other character. That's fine for me, but I need a solution for Python 2.

So the question is: in Python 2.7, how can I convert that Unicode string to UTF-8 while replacing the lone surrogate with some replacement character like U+FFFD? Preferably only standard Python functions and GTK/GLib/G... functions should be used.

Btw. Iconv can convert the string to UTF8 but simply removes the bad character instead of replacing it with U+FFFD.

解决方案

You can do the replacements yourself before encoding:

import re

lone = re.compile(
    ur'''(?x)            # verbose expression (allows comments)
    (                    # begin group
    [\ud800-\udbff]      #   match leading surrogate
    (?![\udc00-\udfff])  #   but only if not followed by trailing surrogate
    )                    # end group
    |                    #  OR
    (                    # begin group
    (?<![\ud800-\udbff]) #   if not preceded by leading surrogate
    [\udc00-\udfff]      #   match trailing surrogate
    )                    # end group
    ''')

u = u'abc\ud834\ud82a\udfcdxyz'
print repr(u)
b = lone.sub(ur'\ufffd',u).encode('utf8')
print repr(b)
print repr(b.decode('utf8'))

Output:

u'abc\ud834\U0001abcdxyz'
'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz'
u'abc\ufffd\U0001abcdxyz'

这篇关于在Python 2 + GTK中检测/删除不成对的代理字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆