为什么codecs.iterdecode()吃空字符串? [英] Why codecs.iterdecode() eats empty strings?

查看:128
本文介绍了为什么codecs.iterdecode()吃空字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么以下两种解码方法返回不同的结果?

Why the following two decoding methods return different results?

>>> import codecs
>>>
>>> data = ['', '', 'a', '']
>>> list(codecs.iterdecode(data, 'utf-8'))
[u'a']
>>> [codecs.decode(i, 'utf-8') for i in data]
[u'', u'', u'a', u'']

这是错误还是预期的行为?我的Python版本2.7.13.

Is this a bug or expected behavior? My Python version 2.7.13.

推荐答案

这很正常. iterdecode在编码的块上使用迭代器,并在解码的块上返回迭代器,但是它不保证一对一的对应关系.它保证的是所有输出块的串联都是对所有输入块的串联的有效解码.

This is normal. iterdecode takes an iterator over encoded chunks and returns an iterator over decoded chunks, but it doesn't promise a one-to-one correspondence. All it guarantees is that the concatenation of all output chunks is a valid decoding of the concatenation of all input chunks.

如果您查看源代码 ,您会看到它明确地丢弃了空的输出​​块:

If you look at the source code, you'll see it's explicitly discarding empty output chunks:

def iterdecode(iterator, encoding, errors='strict', **kwargs):
    """
    Decoding iterator.
    Decodes the input strings from the iterator using an IncrementalDecoder.
    errors and kwargs are passed through to the IncrementalDecoder
    constructor.
    """
    decoder = getincrementaldecoder(encoding)(errors, **kwargs)
    for input in iterator:
        output = decoder.decode(input)
        if output:
            yield output
    output = decoder.decode("", True)
    if output:
        yield output


请注意,原因iterdecode存在,而您自己不会仅对所有块调用decode的原因是,解码过程是有状态的.一个字符的UTF-8编码形式可能会分成多个块.其他编解码器可能确实具有怪异的状态行为,例如字节序列可以反转所有字符的大小写,直到您再次看到该字节序列为止.


Be aware that the reason iterdecode exists, and the reason you wouldn't just call decode on all the chunks yourself, is that the decoding process is stateful. The UTF-8 encoded form of one character might be split over multiple chunks. Other codecs might have really weird stateful behavior, like maybe a byte sequence that inverts the case of all characters until you see that byte sequence again.

这篇关于为什么codecs.iterdecode()吃空字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆