Python - 识别压缩文件类型和解压缩的机制 [英] Python - mechanism to identify compressed file type and uncompress

查看:929
本文介绍了Python - 识别压缩文件类型和解压缩的机制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

压缩文件可以分为以下逻辑组

a。您正在使用的操作系统(* ix,Win)等。

b。不同类型的压缩算法(即.zip,.Z,.bz2,.rar,.gzip)。至少从大多数使用的压缩文件的标准列表。

c。然后我们有焦油球机制 - 我认为没有压缩。但它的行为更像一个串联。

A compressed file can be classified into below logical groups
a. The operating system which you are working on (*ix, Win) etc.
b. Different types of compression algorithm (i.e .zip,.Z,.bz2,.rar,.gzip). Atleast from a standard list of mostly used compressed files.
c. Then we have tar ball mechanism - where I suppose there are no compression. But it acts more like a concatenation.

现在,如果我们开始处理上述一组压缩文件,

a。选项(a)将由python处理,因为它是平台无关语言。

b。选项(b)和(c)似乎有问题。

Now, if we start addressing the above set of compressed files,
a. Option (a) would be taken care by python since it is platform independent language.
b. Option (b) and (c) seems to have a problem.

我需要什么

如何识别文件类型(压缩类型),然后将其压缩?

What do I need
How do I identify the file type (compression type) and then UN-compress them?

喜欢:

fileType = getFileType(fileName)  
switch(fileType):  
case .rar:  unrar....
case .zip:  unzip....

etc  






基于文件识别压缩算法(假设未提供扩展名或不正确)?在python中有什么具体的方法吗?


So the fundamental question is how do we identify the compression algorithm based on the file (assuming the extension is not provided or incorrect)? Is there any specific way to do it in python?

推荐答案

此页面有一个魔术文件签名的列表。抓住你需要的,并把它们在一个字典下面。然后我们需要一个函数来匹配dict键和文件的开头。我写了一个建议,虽然它可以通过预处理 magic_dict 进行优化。一个巨人编译的regexp。

This page has a list of "magic" file signatures. Grab the ones you need and put them in a dict like below. Then we need a function that matches the dict keys with the start of the file. I've written a suggestion, though it can be optimized by preprocessing the magic_dict into e.g. one giant compiled regexp.

magic_dict = {
    "\x1f\x8b\x08": "gz",
    "\x42\x5a\x68": "bz2",
    "\x50\x4b\x03\x04": "zip"
    }

max_len = max(len(x) for x in magic_dict)

def file_type(filename):
    with open(filename) as f:
        file_start = f.read(max_len)
    for magic, filetype in magic_dict.items():
        if file_start.startswith(magic):
            return filetype
    return "no match"

这个解决方案应该是跨平台,当然不依赖文件扩展名,给出带有随机内容的文件的假阳性,这些内容刚刚以某些特定的魔法字节开始。

This solution should be cross-plattform and is of course not dependent on file name extension, but it may give false positives for files with random content that just happen to start with some specific magic bytes.

这篇关于Python - 识别压缩文件类型和解压缩的机制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆