gzip.open().read()的size参数 [英] The size parameter for gzip.open().read()

查看:596
本文介绍了gzip.open().read()的size参数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python中使用gzip库时,经常会遇到使用.read()函数的代码,其模式如下所示:

When working with the gzip library in Python, very often I'd come across code that use the .read() function in a pattern that look like this:

with gzip.open(filename) as bytestream:
    bytestream.read(16) 
    buf = bytestream.read(
        IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS
    )
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)

虽然我熟悉上下文管理器模式,但是我很难真正掌握with上下文管理器中的第一行代码到底在做什么.

While I'm familiar with the context manager pattern, I struggle to really grasp what is it that the first line of code within the with context manager is doing at all.

这是read()函数的文档:

从流中读取最多n个字符.

Read at most n characters from stream.

从基础缓冲区读取,直到我们有n个字符或击中EOF. 如果n为负或省略,请阅读直到EOF.

Read from underlying buffer until we have n characters or we hit EOF. If n is negative or omitted, read until EOF.

如果是这种情况,第一行bytestream.read(16)的功能角色必须是读取并因此跳过了前16个字符,大概是因为它们充当了元数据或标头.但是,当我有一些图像时,我怎么知道使用16作为read调用的参数,而不是32或8或64?

If that is the case, the functional role of the first line bytestream.read(16) would have to be reading and thus skipping the first 16 characters, presumably because they act as meta-data or header. However, when I have some images, how would I know to use 16 as the argument for the read call, instead of say, 32 or, 8, or 64?

除了让作者使用bytestream.read(8)代替bytestream.read(16)或可能使用其他任何值外,我还记得很多时间遇到与上面完全相同的代码.按字符逐个查看文件不会显示可辨别的模式来确定标题字符的长度.

I recalled plenty a time coming across completely identical code as above except having the author use bytestream.read(8) instead of bytestream.read(16) or just as likely, any other value. Digging into the file character-by-character show no discernible pattern to determine the length of the header character.

换句话说,如何确定要在read函数调用中使用的参数?,或者如何知道gzip压缩文件中标题字符的长度?

In other words, how do one determine the parameter to be used in the read function call? or how do one know the length of the header characters in a gzip-compressed file?

我的猜测是它与字节有关,但是在搜索了文档和在线参考之后,我无法确认.

My guess was that it has something to do with the bytes, but after searching through the documentation and online references I can't confirm that.

我的假设是,经过无数小时的故障排除后,前16个字符代表某种标题或元数据.因此,该代码的第一行是跳过这16个字符,并将其余字符存储在名为buf的变量中.但是,在研究数据时,我找不到确定为什么或如何选择值16的方法.我已经逐字符读取了字节,还尝试读取+将它们强制转换为np.float,但是没有可辨别的模式表明元数据在第16个字符结束,而实际数据在第17个字符开始.

My hypothesis, after countless hours of troubleshooting is that the first 16 characters represent some sort of header or meta-data. So the first line in that code is to skip the 16 characters and store the remaining in a variable named buf. However, digging into the data I found no way to determine why or how the value 16 was chosen. I have read the bytes in character by character, and also tried reading + casting them as np.float, but there is no discernible patterns that suggest the meta-data ends at the 16th character and the actual data begins on the 17th.

以下代码从此网站中读取数据,并提取前30个字符.注意,头行结束"(显然是第16次,在\ x1c`的第二次出现之后)和数据开始的位置并不清楚:

The following code reads the data from this website and extracts the first 30 characters. Notice that it is indiscernible where the header row "ends" (16th apparently, after the second appearance of \x1c`) and the data begins:

import gzip
import numpy as np

train_data_filename = 'data_input/train-images-idx3-ubyte.gz'
IMAGE_SIZE = 28
NUM_CHANNELS = 1

def extract_data(filename, num_images):
    with gzip.open(filename) as bytestream:
        first30 = bytestream.read(30)
        return first30

first30= extract_data(train_data_filename, 10)
print(first30)
# returns: b'\x00\x00\x08\x03\x00\x00\xea`\x00\x00\x00\x1c\x00\x00\x00\x1c\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

如果我们修改代码以将其强制转换为np.float32,以使所有字符现在都为数字(浮点),则同样没有明显的模式来区分标头/元数据的结束位置和数据的起始位置.

If we modify the code to cast them as np.float32, such that all characters were now in numeric (float), again there was no apparent pattern to distinguish where the header / meta-data ends and where the data begins.

任何参考或建议将不胜感激!

Any reference or advice would be very appreciated!

推荐答案

从gzip的角度来看,返回给您的一切都是数据.该数据流中没有 元数据或特定于gzip的标头内容,因此不需要任何算法来计算gzip在该流中附加了多少内容:它的字节数prepends为零.

From gzip's perspective, everything it's returning to you is data. There is no metadata or gzip-specific header contents prepended to that data stream, so there's no need for any kind of algorithm to figure out how much content gzip is prepending to that stream: The number of bytes it prepends is zero.

向下滚动到链接页面的底部;有一个标题为 MNIST数据库的文件格式的标题.

Scroll down to the bottom of the page you linked; there's a header titled FILE FORMATS FOR THE MNIST DATABASE.

该格式规范会准确告诉您什么格式,以及每个标头使用多少个字节.具体来说,每个文件中的前四项描述如下:

That format specification tells you exactly what the format is, and thus how many bytes are used for each header. Specifically, the first four items in each file are described as follows:

0000     32 bit integer  0x00000803(2051) magic number 
0004     32 bit integer  60000            number of images 
0008     32 bit integer  28               number of rows 
0012     32 bit integer  28               number of columns 

因此,如果要跳过所有这四个项目,则需要从顶部减去16个字节.

Thus, if you want to skip all four of those items, you would take 16 bytes off the top.

这篇关于gzip.open().read()的size参数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆