如何从s3流式传输一个大的gzip压缩的.tsv文件，进行处理，然后写回s3上的新文件? [英] How to stream a large gzipped .tsv file from s3, process it, and write back to a new file on s3?

查看：104 发布时间：2021/4/3 19:33:33 python csv amazon-s3 s3fs python-s3fs

本文介绍了如何从s3流式传输一个大的gzip压缩的.tsv文件，进行处理，然后写回s3上的新文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个要加载和处理的大文件 s3://my-bucket/in.tsv.gz ，然后将其处理后的版本写回s3输出文件s3://my-bucket/out.tsv.gz .

I have a large file s3://my-bucket/in.tsv.gz that I would like to load and process, write back its processed version to an s3 output file s3://my-bucket/out.tsv.gz.

我如何直接从s3简化 in.tsv.gz ，而又不将所有文件加载到内存中(它无法容纳内存)
如何将已处理的压缩流直接写到s3?

How do I streamline the in.tsv.gz directly from s3 without loading all the file to memory (it cannot fit the memory)
How do I write the processed gzipped stream directly to s3?

在下面的代码中，我演示了如何考虑从s3加载输入gzip压缩的数据帧，以及如果 .tsv 位于本地 bucket_dir_local = ./.

In the following code, I show how I was thinking to load the input gzipped dataframe from s3, and how I would write the .tsv if it were located locally bucket_dir_local = ./.

import pandas as pd
import s3fs
import os
import gzip
import csv
import io

bucket_dir = 's3://my-bucket/annotations/'
df = pd.read_csv(os.path.join(bucket_dir, 'in.tsv.gz'), sep='\t', compression="gzip")

bucket_dir_local='./'
# not sure how to do it with an s3 path
with gzip.open(os.path.join(bucket_dir_local, 'out.tsv.gz'), "w") as f:
    with io.TextIOWrapper(f, encoding='utf-8') as wrapper:
        w = csv.DictWriter(wrapper, fieldnames=['test', 'testing'], extrasaction="ignore")
        w.writeheader()
        for index, row in df.iterrows():
            my_dict = {"test": index, "testing": row[6]}
            w.writerow(my_dict)

编辑: smart_open 看起来很像.

推荐答案

下面是一个虚拟示例，它使用 smart_open

Here is a dummy example to read a file from s3 and write it back to s3 using smart_open

from smart_open import open
import os

bucket_dir = "s3://my-bucket/annotations/"

with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
    with open(
        os.path.join(bucket_dir, "out.tsv.gz"), "wb"
    ) as fout:
        for line in fin:
            l = [i.strip() for i in line.decode().split("\t")]
            string = "\t".join(l) + "\n"
            fout.write(string.encode())

这篇关于如何从s3流式传输一个大的gzip压缩的.tsv文件，进行处理，然后写回s3上的新文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从s3流式传输一个大的gzip压缩的.tsv文件，进行处理，然后写回s3上的新文件? [英] How to stream a large gzipped .tsv file from s3, process it, and write back to a new file on s3?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从s3流式传输一个大的gzip压缩的.tsv文件，进行处理，然后写回s3上的新文件? [英] How to stream a large gzipped .tsv file from s3, process it, and write back to a new file on s3?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭