如何在不下载AWS S3的tar中列出文件? [英] How to list files inside tar in AWS S3 without downloading it?

查看:147
本文介绍了如何在不下载AWS S3的tar中列出文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在寻找想法时,我发现 https://stackoverflow.com/a/54222447/264822 包含zip文件我认为这是一个非常聪明的解决方案.但这取决于具有中央目录的zip文件-tar文件没有

While looking around for ideas I found https://stackoverflow.com/a/54222447/264822 for zip files which I think is a very clever solution. But it relies on zip files having a Central Directory - tar files don't.

我认为我可以遵循相同的一般原则,并将S3文件暴露给 tarfile 通过fileobj参数:

I thought I could follow the same general principle and expose the S3 file to tarfile through the fileobj parameter:

import boto3
import io
import tarfile

class S3File(io.BytesIO):
    def __init__(self, bucket_name, key_name, s3client):
        super().__init__()
        self.bucket_name = bucket_name
        self.key_name = key_name
        self.s3client = s3client
        self.offset = 0

    def close(self):
        return

    def read(self, size):
        print('read: offset = {}, size = {}'.format(self.offset, size))
        start = self.offset
        end = self.offset + size - 1
        try:
            s3_object = self.s3client.get_object(Bucket=self.bucket_name, Key=self.key_name, Range="bytes=%d-%d" % (start, end))
        except:
            return bytearray()
        self.offset = self.offset + size
        result = s3_object['Body'].read()
        return result

    def seek(self, offset, whence=0):
        if whence == 0:
            print('seek: offset {} -> {}'.format(self.offset, offset))
            self.offset = offset

    def tell(self):
        return self.offset

s3file = S3File(bucket_name, file_name, s3client)
tarf = tarfile.open(fileobj=s3file)
names = tarf.getnames()
for name in names:
    print(name)

这工作正常,除了输出如下所示:

This works fine except the output looks like:

read: offset = 0, size = 2
read: offset = 2, size = 8
read: offset = 10, size = 8192
read: offset = 8202, size = 1235
read: offset = 9437, size = 1563
read: offset = 11000, size = 3286
read: offset = 14286, size = 519
read: offset = 14805, size = 625
read: offset = 15430, size = 1128
read: offset = 16558, size = 519
read: offset = 17077, size = 573
read: offset = 17650, size = 620
(continued)

tarfile无论如何都只是读取整个文件,所以我什么也没得到.无论如何,有没有使tarfile只读取它需要的文件部分?我能想到的唯一替代方法是重新实现tar文件解析,使其:

tarfile is just reading the whole file anyway so I haven't gained anything. Is there anyway of making tarfile only read the parts of the file it needs? The only alternative I can think of is re-implementing the tar file parsing so it:

  1. 读取512字节的标头并将其写入BytesIO缓冲区.
  2. 获取后面文件的大小,并将零写入BytesIO缓冲区.
  3. 将文件跳过到下一个标题.
  1. Reads the 512 bytes header and writes this into a BytesIO buffer.
  2. Gets the size of the file following and writes zeroes into the BytesIO buffer.
  3. Skips over the file to the next header.

但这似乎太复杂了.

推荐答案

我的错误.我实际上正在处理tar.gz文件,但我假设zip和tar.gz类似.它们不是-tar是一个存档文件,然后将其压缩为gzip,因此要读取tar,您必须先将其解压缩.我从tar文件中提取位的想法行不通.

My mistake. I'm actually dealing with tar.gz files but I assumed that zip and tar.gz are similar. They're not - tar is an archive file which is then compressed as gzip, so to read the tar you have to decompress it first. My idea of pulling bits out of the tar file won't work.

起作用的是:

s3_object = s3client.get_object(Bucket=bucket_name, Key=file_name)
wholefile = s3_object['Body'].read()
fileobj = io.BytesIO(wholefile)
tarf = tarfile.open(fileobj=fileobj)
names = tarf.getnames()
for name in names:
    print(name)

我怀疑原始代码可用于tar文件,但我没有任何尝试.

I suspect the original code will work for a tar file but I don't have any to try it on.

这篇关于如何在不下载AWS S3的tar中列出文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆