使用boto3和python从Amazon s3读取zip文件 [英] Read zip files from amazon s3 using boto3 and python

查看:458
本文介绍了使用boto3和python从Amazon s3读取zip文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个s3存储桶,其中没有大量的zip文件(以GB为单位).我需要计算所有zip文件的数据长度.我通过boto3,但没有听懂. 我不确定它是否可以直接读取zip文件,但是我有一个过程-

I have an s3 bucket which has a large no of zip files having size in GBs. I need to calculate all zip files data length. I go through boto3 but didn't get it. I am not sure if it can directly read zip file or not but I have a process-

  1. 与水桶连接.
  2. 从存储桶文件夹(假设文件夹为Mydata)中读取zip文件.
  3. 将zip文件提取到另一个名为Extracteddata的文件夹中.
  4. 读取Extracteddata文件夹并对文件进行操作.

注意:所有内容均不应下载到本地存储中.所有过程都在S3到S3上进行. 任何建议表示赞赏.

Note: Nothing shouldn't download on local storage. All process goes on S3 to S3. Any suggestions are appreciated.

推荐答案

要完成的操作是不可能的,正如 John Rotenstein的解释答案.您必须下载zip文件-不一定要下载到本地存储,但至少要下载到本地内存,这会消耗您的本地带宽.无法在S3上运行任何代码.

What you want to do is impossible, as explained by John Rotenstein's answer. You have to download the zipfile—not necessarily to local storage, but at least to local memory, using up your local bandwidth. There's no way to run any code on S3.

但是,无论如何,也许有一种方法可以使您真正获得满意的生活.

However, there may be a way to get what you're really after here anyway.

如果您可以下载例如8KB的文件,而不是整个5GB的文件,这样就足够了吗?如果是这样,并且您愿意做一些工作,那么您很幸运.如果您必须下载1MB,但可以减少很多工作,该怎么办?

If you could just download, say, 8KB worth of the file, instead of the whole 5GB, would that be good enough? If so—and if you're willing to do a bit of work—then you're in luck. What if you had to download, say, 1MB, but could do a lot less work?

如果1MB听起来还不错,并且您愿意接受一些技巧:

If 1MB doesn't sound too bad, and you're willing to get a little hacky:

您唯一想要的是计算zip文件中有多少个文件.对于zipfile,所有这些信息都可以在中央目录中找到,而在文件的末尾则只有一小部分数据.

The only thing you want is a count of how many files are in the zipfile. For a zipfile, all of that information is available in the central directory, a very small chunk of data at the very end of the file.

如果您拥有整个中央目录,即使您丢失了文件的其余部分,stdlib中的zipfile模块也可以很好地处理它. 尚无记录,但是至少在最近的CPython和PyPy 3.x中包含的版本中确实如此.

And if you have the entire central directory, even if you're missing the rest of the file, the zipfile module in the stdlib will handle it just fine. It isn't documented to do so, but, at least in the versions included in recent CPython and PyPy 3.x, it definitely will.

所以,您可以做的是这样:

So, what you can do is this:

  • 发出 HEAD 请求只获取标题. (在boto中,您可以使用 head_object .)
  • Content-Length标头中提取文件大小.
  • 使用 Range发出GET请求标头只能从size-1048576下载到末尾. (在boto中,我认为您可能必须致电
  • Make a HEAD request to get just the headers. (In boto, you do this with head_object.)
  • Extract the file size from the Content-Length header.
  • Make a GET request with a Range header to only download from, say, size-1048576 to the end. (In boto, I believe you may have to call get_object instead of one of the download* convenience methods, and you have to format the Range header value yourself.)

现在,假设您在缓冲区buf中拥有了最后1MB:

Now, assuming you've got that last 1MB in a buffer buf:

z = zipfile.ZipFile(io.BytesIO(buf))
count = len(z.filelist)

通常,1MB绰绰有余.但是什么时候不呢?好吧,这里的东西有些骇人听闻. zipfile模块知道您还需要多少个字节,但是它提供给您的唯一信息是在异常描述的文本中.所以:

Usually, 1MB is more than enough. But what about when it isn't? Well, here's where things get a little hacky. The zipfile module knows how many more bytes you need—but the only place it gives you that information is in the text of the exception description. So:

try:
    z = zipfile.ZipFile(io.BytesIO(buf))
except ValueError as e:
    m = re.match(r'negative seek value -(\d+)', z.args[0])
    if not m:
        raise
    extra = int(m.group(1))
    # now go read from size-1048576-extra to size-1048576, prepend to buf, try again
count = len(z.filelist)


如果1MB的带宽听起来已经太大了,或者您不想依赖zipfile模块的未记录行为,那么您只需要做更多的工作即可.


If 1MB already sounds like too much bandwidth, or you don't want to rely on undocumented behavior of the zipfile module, you just need to do a bit more work.

在几乎每种情况下,您甚至都不需要整个中央目录,只需end of central directory record中的total number of entries字段-中央目录末尾的一小块数据.

In almost every case, you don't even need the whole central directory, just the total number of entries field within the end of central directory record—an even smaller chunk of data at the very end of the central directory.

因此,请执行与上述相同的操作,但只读取最后8KB而不是最后1MB.

So, do the same as above, but only read the last 8KB instead of the last 1MB.

然后,根据 zip格式规范,编写自己的解析器.

And then, based on the zip format spec, write your own parser.

当然,您不需要编写一个完整的解析器,甚至不需要编写一个完整的解析器.您只需要足够处理total number of entries到末尾的字段.除了zip64 extensible data sector和/或.ZIP file comment以外,所有字段都是固定大小的字段.

Of course you don't need to write a complete parser, or even close to it. You just need enough to deal with the fields from total number of entries to the end. All of which are fixed-size fields except for zip64 extensible data sector and/or .ZIP file comment.

有时(例如,对于带有大量注释的zip文件),您需要读取更多数据以获取计数.这应该很少见,但是如果由于某种原因在zip文件中更常见,则可以将8192个猜测值更改为更大的值.

Occasionally (e.g., for zipfiles with huge comments), you will need to read more data to get the count. This should be pretty rare, but if, for some reason, it turns out to be more common with your zipfiles, you can just change that 8192 guess to something larger.

这篇关于使用boto3和python从Amazon s3读取zip文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆