使用iter_chunks()从S3解压缩字节块 [英] Ungzipping chunks of bytes from from S3 using iter_chunks()

查看:146
本文介绍了使用iter_chunks()从S3解压缩字节块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我遇到了使用boto3中的iter_chunks()方法将我从S3读取的字节大块解压缩的问题.将文件逐块解压缩的策略起源于此问题.. >

代码如下:

dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
    data = dec.decompress(chunk)
    print(len(chunk), len(data))

# 524288 65505
# 524288 0
# 524288 0
# ...

此代码最初打印出65505的值,此后对于每个后续迭代输出0.我的理解是,这段代码应解压缩每个压缩块,然后打印未压缩版本的长度.

有什么我想念的吗?

解决方案

您的输入文件似乎是块gzip(bgzip http://www.htslib.org/doc/bgzip.html ),因为您有65k的数据块已解码.

GZip文件可以串联在一起(请参阅 https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage )和Block GZip使用此链接来连接同一文件的块,因此,通过使用关联的索引,仅包含感兴趣的信息必须解码.

因此,要对块gzip文件进行流解码,您需要使用一个块中的剩余数据来开始一个新的块.例如

# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
    # decompress this chunk of data
    data = dec.decompress(chunk)
    # bgzip is a concatenation of gzip files
    # if there is stuff in this chunk beyond the current block
    # it needs to be processed
    while len(dec.unused_data):
        # end of one block
        leftovers = dec.unused_data
        # create a new decompressor
        dec = zlib.decompressobj(32+zlib.MAX_WBITS)
        #decompress the leftovers
        data = data+dec.decompress(leftovers)
    # TODO handle data

I am encountering issues ungzipping chunks of bytes that I am reading from S3 using the iter_chunks() method from boto3. The strategy of ungzipping the file chunk-by-chunk originates from this issue.

The code is as follows:

dec = zlib.decompressobj(32 + zlib.MAX_WBITS)
for chunk in app.s3_client.get_object(Bucket=bucket, Key=key)["Body"].iter_chunks(2 ** 19):
    data = dec.decompress(chunk)
    print(len(chunk), len(data))

# 524288 65505
# 524288 0
# 524288 0
# ...

This code initially prints out the value of 65505 followed thereafter by 0 for every subsequent iteration. My understanding is that this code should ungzip each compressed chunk, and then print the length of the uncompressed version.

Is there something I'm missing?

解决方案

It seems like your input file is block gzip (bgzip http://www.htslib.org/doc/bgzip.html ) because you have a 65k block of data decoded.

GZip files can be concatenated together ( see https://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage) and Block GZip uses this to concatenate blocks of the same file, so that by using an associated index only the specific block containing information of interest has to be decoded.

So to stream decode a block gzip file, you need to use the leftover data from one block to start a new one. E.g.

# source is a block gzip file see http://www.htslib.org/doc/bgzip.html
dec = zlib.decompressobj(32+zlib.MAX_WBITS)
for chunk in raw:
    # decompress this chunk of data
    data = dec.decompress(chunk)
    # bgzip is a concatenation of gzip files
    # if there is stuff in this chunk beyond the current block
    # it needs to be processed
    while len(dec.unused_data):
        # end of one block
        leftovers = dec.unused_data
        # create a new decompressor
        dec = zlib.decompressobj(32+zlib.MAX_WBITS)
        #decompress the leftovers
        data = data+dec.decompress(leftovers)
    # TODO handle data

这篇关于使用iter_chunks()从S3解压缩字节块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆