验证字节流是否为有效的UTF-8(或其他编码)而不带副本 [英] Validate that a stream of bytes is valid UTF-8 (or other encoding) without copy

查看:100
本文介绍了验证字节流是否为有效的UTF-8(或其他编码)而不带副本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这也许是一个微优化,但是我想检查给定字节的流在通过我的应用程序时是否是有效的UTF-8,但我不想保留得到的解码后的代码点。换句话说,如果我要调用 large_string.decode('utf-8'),假设编码成功,则我不希望保留通过解码返回的unicode字符串,

This is perhaps a micro-optimization, but I would like to check that a stream of given bytes is valid UTF-8 as it passes through my application, but I don't want to keep the resulted decoded code points. In other words, if I were to call large_string.decode('utf-8'), assuming the encoding succeeds I have no desire to keep the unicode string returned by decoding, and would prefer not to waste memory on it.

我可以通过多种方式来做到这一点,例如一次读取几个字节,然后尝试 decode(),然后追加更多字节,直到 decode()成功(或者我已经用完了单个字符的最大字节数)在编码中)。但是,在ISTM中,应该有可能以一种简单的方式丢弃现有的Unicode字符,而不必自己动手使用现有的解码器。

There are various ways I could do this, for example read a few bytes at a time, attempt to decode(), then append more bytes until decode() succeeds (or I've exhausted the maximum number of bytes for a single character in the encoding). But ISTM it should be possible to use the existing decoder in a way that simply throws away the decoded unicode characters and not have to roll my own. But nothing immediately comes to mind scouring the stdlib docs.

推荐答案

您可以使用提供的增量解码器通过 codecs 模块

You can use the incremental decoder provided by the codecs module:

utf8_decoder = codecs.getincrementaldecoder('utf8')()

这是 IncrementalDecoder()实例。然后,您可以依次输入此解码器数据 并验证流:

# for each partial chunk of data:
    try:
        utf8_decoder.decode(chunk)
    except UnicodeDecodeError:
        # invalid data

解码器返回到目前为止已解码的数据(减去部分多字节序列,这些序列将在下一次解码块时保留为状态)。那些较小的字符串创建和丢弃起来很便宜,您不会在这里创建较大的字符串。

The decoder returns the data decoded so far (minus partial multi-byte sequences, those are kept as state for the next time you decode a chunk). Those smaller strings are cheap to create and discard, you are not creating a large string here.

您无法提供上述循环部分数据,因为UTF-8是使用可变字节数的格式;

You can't feed the above loop partial data, because UTF-8 is a format using a variable number of bytes; a partial chunk is liable to have invalid data at the start.

如果您不能从一开始就进行验证,那么您的第一个大块可能从最多三个连续字节开始。您可以只删除第一个:

If you can't validate from the start, then your first chunk may start with up to three continuation bytes. You could just remove those first:

first_chunk = b'....'
for _ in range(3):
    if first_chunk[0] & 0xc0 == 0x80:
        # remove continuation byte
        first_chunk = first_chunk[1:]

现在,UTF-8具有足够的结构,因此您也可以使用更多的此类二进制测试来完全验证Python流中的流,但您根本无法与内置解码器的解码速度匹配。

Now, UTF-8 is structured enough so you could also validate the stream entirely in Python code using more such binary tests, but you simply are not going to match the speed that the built-in decoder can decode at.

这篇关于验证字节流是否为有效的UTF-8(或其他编码)而不带副本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆