UTF-16到Ascii忽略十进制值大于127的字符 [英] UTF-16 to Ascii ignoring characters with decimal value greater than 127

查看:111
本文介绍了UTF-16到Ascii忽略十进制值大于127的字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有很多解决此问题的方法,但是我的观点很特殊,我可能会截断utf16数据,但必须尽最大努力处理转换,而解码和编码会因UnicodeDecodeError而失败.因此想出了以下python代码. 请让我知道您对我如何改进它们以进行更快处理的意见.

I know there are quite a few solutions for this problem but mine was peculiar in the sense that, I might get truncated utf16 data and yet have to make the best effort of dealing with conversions where decode and encode will fail with UnicodeDecodeError. So came up with the following code in python. Please let me know your comments on how I can improve them for faster processing.

    try:
        # conversion to ascii if utf16 data is formatted correctly
        input = open(filename).read().decode('UTF16')
        asciiStr = input.encode('ASCII', 'ignore')
        open(filename).close()
        return asciiStr
    except:
        # if fail with UnicodeDecodeError, then use brute force 
        # to decode truncated data
        try:
            unicode = open(filename).read()
            if (ord(unicode[0]) == 255 and ord(unicode[1]) == 254):
                print("Little-Endian format, UTF-16")
                leAscii = "".join([(unicode[i]) for i in range(2, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
                open(filename).close()
                return leAscii
            elif (ord(unicode[0]) == 254 and ord(unicode[1]) == 255):
                print("Big-Endian format, UTF-16")
                beAscii = "".join([(unicode[i]) for i in range(3, len(unicode), 2) if 0 < ord(unicode[i]) < 127])
                open(filename).close()
                return beAscii
            else:
                open(filename).close()
                return None
        except:
            open(filename).close()
            print("Error in converting to ASCII")
            return None

推荐答案

要容忍错误,您可以在字节字符串的解码方法中使用可选的第二个参数.在此示例中,悬挂的第三个字节('c')被替换为替换字符" U + FFFD:

To tolerate errors you could use the optional second argument to the byte-string's decode method. In this example the dangling third byte ('c') is replaced with the "replacement character" U+FFFD:

>>> 'abc'.decode('UTF-16', 'replace')
u'\u6261\ufffd'

还有一个'ignore'选项,它将仅丢弃无法解码的字节:

There is also an 'ignore' option which will simply drop bytes that can't be decoded:

>>> 'abc'.decode('UTF-16', 'ignore')
u'\u6261'

虽然通常需要一个容忍"错误编码文本的系统,但通常很难准确地定义在这些情况下的预期行为.您可能会发现提供处理"编码错误的文本的要求的人并没有完全掌握字符编码的概念.

While it is common to desire a system that is "tolerant" of incorrectly encoded text, it is often quite difficult to define precisely what the expected behavior is in these situations. You may find that the one who provided the requirement to "deal with" incorrectly encoded text does not fully grasp the concept of character encoding.

这篇关于UTF-16到Ascii忽略十进制值大于127的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆