Python,有人可以仅通过base64编码来猜测文件的类型吗? [英] Python, can someone guess the type of a file only by its base64 encoding?

查看:177
本文介绍了Python,有人可以仅通过base64编码来猜测文件的类型吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下内容:

image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""

这只是一个点图像(来自 https://en.wikipedia.org/wiki/Data_URI_scheme ).但是我不知道它是图像还是文本等.是否可以了解仅具有此编码字符串的内容?我在Python中尝试过,但这也是一个普遍的问题.因此,欢迎您提供两者的任何见识.

This is just a dot image (from https://en.wikipedia.org/wiki/Data_URI_scheme). But I do not know if it is image or text etc. Is it possible to understand what it is only having this encoded string? I try it in Python, but it is also general question. So any insight in both is highly welcome.

推荐答案

您至少不能没有解码,因为帮助识别文件类型的字节分布在base64字符上,这些字符不直接与整个字节.每个字符编码6个,这意味着每4个字符就有3个字节被编码.

You can't, at least not without decoding, because the bytes that help identify the filetype are spread across the base64 characters, which don't directly align with whole bytes. Each character encodes 6 bits, which means that for every 4 characters, there are 3 bytes encoded.

识别文件类型需要访问不同块大小的那些字节.例如,可以从字节FF D8或FF D9识别JPEG图像,但这是两个字节;随后的第三个字节也必须编码为4个字符的块的一部分.

Identifying a filetype requires access to those bytes in different block sizes. A JPEG image for example, can be identified from the bytes FF D8 or FF D9, but that's two bytes; the third byte that follows must also be encoded as part of the 4-character block.

可以做的是解码base64字符串的足够以进行文件类型指纹识别.因此,您可以解码前4个字符以获取3个字节,然后使用前两个字符查看对象是否为JPEG图像.仅从第一个或最后一个字节序列中就可以识别出多种文件格式(可以通过前8个字节来识别PNG图像,可以通过前6个字节来识别GIF等).仅从base64字符串中解码那些字节是很简单的.

What you can do is decode just enough of the base64 string to do your filetype fingerprinting. So you can decode the first 4 characters to get the 3 bytes, and then use the first two to see if the object is a JPEG image. A large number of file formats can be identified from just the first or last series of bytes (a PNG image can be identified by the first 8 bytes, a GIF by the first 6, etc.). Decoding just those bytes from the base64 string is trivial.

您的样本是PNG图片;您可以使用 imghdr模块:

Your sample is a PNG image; you can test for image types using the imghdr module:

>>> import imghdr
>>> image_data = """iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="""
>>> sample = image_data[:44].decode('base64')  # 33 bytes / 3 times 4 is 44 base64 chars
>>> for tf in imghdr.tests:
...     res = tf(sample, None)
...     if res:
...         break
...
>>> print res
png

我只使用了base64数据的前33个字节,以呼应imghdr.what()函数将从传递给它的文件中读取的内容(它读取32个字节,但该数字不除以3).

I only used the first 33 bytes from the base64 data, to echo what the imghdr.what() function will read from the file you pass it (it reads 32 bytes, but that number doesn't divide by 3).

有一个等效的 soundhdr模块,还有一个

There is an equivalent soundhdr module, and there is also the python-magic project that lets you pass in a number of bytes to determine a file type.

这篇关于Python,有人可以仅通过base64编码来猜测文件的类型吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆