带有二进制数据的python文件I/O [英] python file I/O with binary data

查看:92
本文介绍了带有二进制数据的python文件I/O的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从mp3数据中提取jpeg类型的比特,实际上这将是唱片艺术. 我曾考虑过使用名为mutagen的库,但出于某些实践目的,我想尝试一下.

I'm extracting jpeg type bits from mp3 data actually it will be album arts. I thought about using library called mutagen, but I'd like to try with bits for some practice purpose.

import os
import sys
import re

f = open(sys.argv[1], "rb")
#sys.argv[1] gets mp3 file name ex) test1.mp3

saver = ""
for value in f:
    for i in value:
        hexval = hex(ord(i))[2:]
        if (ord(i) == 0):
            saver += "00" #to match with hex form
        else:
            saver += hexval


header = "ffd8"
tail = "ffd9"

这部分代码是将mp3转换为位形式,然后将其转换为十六进制 并找到以"ffd8"开头并以"ffd9"结尾的jpeg预告片

this part of code is to get mp3 as bit form, and then transform it into hex and find jpeg trailers which starts as "ffd8" and ends with "ffd9"

frontmatch = re.search(header,saver)
endmatch = re.search(tail, saver)
startIndex = frontmatch.start()
endIndex = endmatch.end()

jpgcontents = saver[startIndex:endIndex]
scale = 16 # equals to hexadecimal
numbits = len(jpgcontents) * 4 #log2(scale)
bitcontents = bin(int(jpgcontents, scale))[2:].zfill(numbits)

在这里,我得到了标题和尾部之间的位,并将其转换为 二进制形式.应该是mp3文件的jpg部分.

and here, I get the bits between the header and tail and transform it into binary form. Which supposed to be the jpg part of the mp3 files.

txtfile = open(sys.argv[1] + "_tr.jpg", "w")
txtfile.write(bitcontents)

,然后将bin写入类型为jpg的新文件中. 对不起,我将其命名为txtfile.

and I wrote the bin to the new file with writing type as jpg. sorry for my wrong naming as txtfile.

但是这些代码给出的错误是

But these codes gave the error which is

Error interpreting JPEG image file
(Not a JPEG file: starts with 0x31 0x31)

我不确定我提取的位是否错误或正在写入文件 步骤是错误的.否则代码中可能还有其他问题.

I'm not sure whether the bits I extracted are wrong or writing to the file step is wrong. Or there might be other problem in code.

我正在使用python 2.6的linux版本.有什么问题吗 只是将str类型的bin数据写为JPG?

I'm working in linux version with python 2.6. Is there anything wrong with just writing str type of bin data as JPG?

推荐答案

您正在创建ASCII零和1的字符串,即\x30\x31,但是JPEG文件必须是正确的二进制数据.因此,在文件应具有单个字节(例如)\xd8的地方,您应该具有以下八个字节:11011000\x31\x31\x30\x31\x31\x30\x30\x30.

You are creating a string of ASCII zeroes and ones, i.e. \x30 and \x31, but the JPEG file needs to be proper binary data. So where your file should have a single byte of (for example) \xd8 you instead have these eight bytes: 11011000, or \x31\x31\x30\x31\x31\x30\x30\x30.

您不需要做所有麻烦的转换工作.您可以直接搜索所需的字节模式,并使用\x十六进制转义序列将其写入.而且您甚至不需要正则表达式:简单的字符串.index.find方法可以轻松,快速地做到这一点.

You don't need to do all that messy conversion stuff. You can just search directly for the desired byte patterns, writing them using \x hex escape sequences. And you don't even need regex: the simple string .index or .find methods can do this easily and quickly.

with open(fname, 'rb') as f:
    data = f.read()

header = "\xff\xd8"
tail = "\xff\xd9"

try:
    start = data.index(header)
    end = data.index(tail, start) + 2
except ValueError:
    print "Can't find JPEG data!"
    exit()

print 'Start: %d End: %d Size: %d' % (start, end, end - start)

with open(fname + "_tr.jpg", 'wb') as f:
    f.write(data[start:end])

(在Python 2.6.6上测试)

(Tested on Python 2.6.6)

但是,像这样提取嵌入的JPEG数据并不是万无一失的,因为MP3声音数据中可能存在那些首部和尾部字节序列.

However, extracting embedded JPEG data like this isn't foolproof, since it's possible that those header and tail byte sequences exist in the MP3 sound data.

FWIW,一种将二进制数据转换为十六进制字符串并返回的更简单方法是使用 binascii 模块.

FWIW, a simpler way to translate binary data to hex strings and back is to use hexlify and unhexlify from the binascii module.

以下是进行这些转换的一些示例,无论是否使用binascii函数.

Here are some examples of doing these transformations, both with and without the binascii functions.

from binascii import hexlify, unhexlify

#Create a string of all possible byte values
allbytes = ''.join([chr(i) for i in xrange(256)])
print 'allbytes'
print repr(allbytes)

print '\nhex list'
print [hex(ord(v))[2:].zfill(2) for v in allbytes]
hexstr = hexlify(allbytes)

print '\nhex string'
print hexstr
newbytes = ''.join([chr(int(hexstr[i:i+2], 16)) for i in xrange(0, len(hexstr), 2)])

print '\nNew bytes'
print repr(newbytes)

print '\nUsing unhexlify'
print repr(unhexlify(hexstr))    

输出

allbytes
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

hex list
['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '0a', '0b', '0c', '0d', '0e', '0f', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '1a', '1b', '1c', '1d', '1e', '1f', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '2a', '2b', '2c', '2d', '2e', '2f', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '3a', '3b', '3c', '3d', '3e', '3f', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '4a', '4b', '4c', '4d', '4e', '4f', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '5a', '5b', '5c', '5d', '5e', '5f', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '6a', '6b', '6c', '6d', '6e', '6f', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '7a', '7b', '7c', '7d', '7e', '7f', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '8a', '8b', '8c', '8d', '8e', '8f', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '9a', '9b', '9c', '9d', '9e', '9f', 'a0', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7', 'a8', 'a9', 'aa', 'ab', 'ac', 'ad', 'ae', 'af', 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'b8', 'b9', 'ba', 'bb', 'bc', 'bd', 'be', 'bf', 'c0', 'c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'ca', 'cb', 'cc', 'cd', 'ce', 'cf', 'd0', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9', 'da', 'db', 'dc', 'dd', 'de', 'df', 'e0', 'e1', 'e2', 'e3', 'e4', 'e5', 'e6', 'e7', 'e8', 'e9', 'ea', 'eb', 'ec', 'ed', 'ee', 'ef', 'f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'fa', 'fb', 'fc', 'fd', 'fe', 'ff']

hex string
000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f303132333435363738393a3b3c3d3e3f404142434445464748494a4b4c4d4e4f505152535455565758595a5b5c5d5e5f606162636465666768696a6b6c6d6e6f707172737475767778797a7b7c7d7e7f808182838485868788898a8b8c8d8e8f909192939495969798999a9b9c9d9e9fa0a1a2a3a4a5a6a7a8a9aaabacadaeafb0b1b2b3b4b5b6b7b8b9babbbcbdbebfc0c1c2c3c4c5c6c7c8c9cacbcccdcecfd0d1d2d3d4d5d6d7d8d9dadbdcdddedfe0e1e2e3e4e5e6e7e8e9eaebecedeeeff0f1f2f3f4f5f6f7f8f9fafbfcfdfeff

New bytes
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

Using unhexlify
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'

请注意,此代码需要进行一些修改才能在Python 3上运行(除了将print语句转换为print函数调用外),因为纯Python 3字符串是Unicode字符串,而不是字节字符串.

Note that this code needs some modifications to run on Python 3 (apart from converting the print statements to print function calls) because plain Python 3 strings are Unicode strings, not byte strings.

这篇关于带有二进制数据的python文件I/O的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆