在S3上串联文件 [英] Concatenate files on S3

查看:69
本文介绍了在S3上串联文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在一个s3文件夹中获得了几个文件(130K个文件,总大小为2GB).每个文件都有Json数据,可以是一个或多个记录.我需要将这些文件合并为单个Json文件,并将其存储在s3上.我不想将文件下载到本地计算机,然后合并.是否可以使用适用于Java的AWS开发工具包来做到这一点?

We are getting several files in one s3 folder ( 130K files , combined size is 2GB ). Each file has Json data , could be one or many records. I need to merge these files into a single Json file and store it on s3. I don't want to download the files to local machine and then combine. Is there a way to do it using AWS SDK for Java ?

推荐答案

最简单的方法是使用 Amazon Athena 读取并合并文件.Athena是基于 Presto 的托管查询服务,可以读取许多不同的文件格式.

The simplest way to achieve this would be to use Amazon Athena to read and combine the files. Athena is a managed query service based on Presto that can read many different file formats.

步骤流程为:

  • 在Athena中创建一个表定义,该表定义输入文件格式和输入数据的位置
    • (您可以使用AWS Glue搜寻器为您执行此操作)
    • 这将从源文件中检索数据并将输出写入新位置
    • 您可以指定输出格式和位置

    将Athena视为Amazon S3之上的查询层".它从给定S3目录中的所有文件中读取输入,然后可以将结果输出回S3.您可以执行简单的 SELECT * 复制所有数据,也可以选择仅选择所需的字段和条目来操作结果(使用 SELECT WHERE).

    Think of Athena as a "query layer" on top of Amazon S3. It reads the input from all files in a given S3 directory and can then output the results back to S3. You can do a simple SELECT * to copy all the data, or you can choose to manipulate the results by selecting only desired fields and entries (using SELECT and WHERE).

    Athena可以从管理控制台运行,也可以通过普通的AWS开发工具包(例如Java)触发.

    Athena can be run from the management console, or triggered via a normal AWS SDK (such as Java).

    使用Athena的好处在于,无需下载源文件并上传结果-这全部由Athena完成.

    The benefit of using Athena is that there is no need to download the source files and upload the result — this will all be done by Athena.

    雅典娜是根据从磁盘读取的数据量来收费的.压缩文件可以降低成本.

    Athena is charged based on the amount of data read from disk. Compressed files reduce this cost.

    这篇关于在S3上串联文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆