读入二进制JPEG标题(在Python中) [英] Reading-in a binary JPEG-Header (in Python)

查看:193
本文介绍了读入二进制JPEG标题(在Python中)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读一个JPEG标题并对其进行分析. 根据 Wikipedia ,标头由一系列标记组成.每个标记均以FF xx开头,其中xx是特定的标记ID.

I would like to read in a JPEG-Header and analyze it. According to Wikipedia, the header consists of a sequences of markers. Each Marker starts with FF xx, where xx is a specific Marker-ID.

所以我的想法是简单地以二进制格式读取图像,并在二进制流中寻找相应的字符组合.这应该使我能够在相应的标记字段中拆分标题.

So my idea, was to simply read in the image in binary format, and seek for the corresponding character-combinations in the binary stream. This should enable me to split the header in the corresponding marker-fields.

例如,当我读入图像的前20个字节时,这就是我收到的信息:

For instance, this is, what I receive, when I read in the first 20 bytes of an image:

binary_data = open('picture.jpg','rb').read(20)
print(binary_data)

b'\ xff \ xd8 \ xff \ xe1- \ xfcExif \ x00 \ x00MM \ x00 * \ x00 \ x00 \ x00 \ x08'

b'\xff\xd8\xff\xe1-\xfcExif\x00\x00MM\x00*\x00\x00\x00\x08'

我的问题现在是:

1)为什么python不能向我返回2字节的漂亮块(十六进制格式). 我期望这样的事情: b'\xff \xd8 \xff \xe1 \x-' ... and so on.某些以'\ x'分隔的块比2个字节长得多.

1) Why does python not return me nice chunks of 2 bytes (in hex-format). Somthing like this I would expect: b'\xff \xd8 \xff \xe1 \x-' ... and so on. Some blocks delimited by '\x' are much longer than 2 bytes.

2)为什么返回的字符串中有像-, M, *这样的符号?这些不是我希望从字节字符串中得到的十六进制表示形式的字符(我认为只有0-9,a-f).

2) Why are there symbols like -, M, * in the returned string? Those are no characters of a hex representation I expect from a byte string (only: 0-9, a-f, I think).

两种观察都妨碍了我编写一个简单的解析器. 因此,最终我的问题总结为: 如何在Python中正确读取和解析JPEG标头?

Both observations hinder me in writing a simple parser. So ultimately my question summarizes to: How do I properly read-in and parse a JPEG Header in Python?

推荐答案

您似乎过于担心控制台上表示二进制数据的方式.不用担心.

You seem overly worried about how your binary data is represented on your console. Don't worry about that.

print(..)适用于bytes对象的基于 default 的内置基于字符串的表示形式仅仅是可打印的ASCII字符(除少数例外),所有其他字符均已转义"十六进制序列".例外情况是半特殊字符,例如\"',它们可能会弄乱字符串表示形式.但是这种替代表示形式不会以任何方式更改值!

The default built-in string-based representation that print(..) applies to a bytes object is just "printable ASCII characters as such (except a few exceptions), all others as an escaped hex sequence". The exceptions are semi-special characters such as \, ", and ', which could mess up the string representation. But this alternative representation does not change the values in any way!

>>> a = bytes([1,2,4,92,34,39])
>>> a
b'\x01\x02\x04\\"\''
>>> a[0]
1

看看整个对象是如何像字符串一样打印"出来的,但是它的各个元素仍然是完全正常的字节?

See how the entire object is printed 'as if' it's a string, but its individual elements are still perfectly normal bytes?

如果您有字节数组,并且不喜欢此默认值的外观,则可以编写自己的字节数组.但是-为清楚起见-与解析文件仍然没有任何关系.

If you have a byte array and you don't like the appearance of this default, then you can write your own. But – for clarity – this still doesn't have anything to do with parsing a file.

>>> binary_data = open('iaijiedc.jpg','rb').read(20)
>>> binary_data
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x02\x01\x00H\x00H\x00\x00'
>>> ''.join(['%02x%02x ' % (binary_data[2*i],binary_data[2*i+1]) for i in range(len(binary_data)>>1)])
'ffd8 ffe0 0010 4a46 4946 0001 0201 0048 0048 0000 '

为什么python不能向我返回2字节的漂亮块(十六进制格式)?

Why does python not return me nice chunks of 2 bytes (in hex-format)?

因为您不询问.您需要一个bytes序列,这就是您得到的.如果需要两字节的块,请在读取后对其进行转换.

Because you don't ask it to. You are asking for a sequence of bytes, and that's what you get. If you want chunks of two-bytes, transform it after reading.

上面的代码仅打印数据;创建一个包含2个字节的单词的新列表,对其进行循环并转换为每个2个字节,或使用

The code above only prints the data; to create a new list that contains 2-byte words, loop over it and convert each 2 bytes or use unpack (there are actually several ways):

>>> wd = [unpack('>H', binary_data[x:x+2])[0] for x in range(0,len(binary_data),2)]
>>> wd
[65496, 65504, 16, 19014, 18758, 1, 513, 72, 72, 0]
>>> [hex(x) for x in wd]
['0xffd8', '0xffe0', '0x10', '0x4a46', '0x4946', '0x1', '0x201', '0x48', '0x48', '0x0']

我正在使用小尾数说明符<unpack中的无符号短符H,因为(我假设)这些是表示JPEG 2字节代码的常规方法.如果您想从中获得相关信息,请查阅文档.

I'm using the little-endian specifier < and unsigned short H in unpack, because (I assume) these are the conventional ways to represent JPEG 2-byte codes. Check the documentation if you want to derive from this.

这篇关于读入二进制JPEG标题(在Python中)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆