如何将Amazon S3对象移动到分区目录中 [英] How to move Amazon S3 objects into partitioned directories

查看:82
本文介绍了如何将Amazon S3对象移动到分区目录中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以具有以下结构的s3存储桶为例,该文件的格式为francescototti_yyyy_mm_dd_hh.csv.gz:

Take for example an s3 bucket with the following structure with files of the form francescototti_yyyy_mm_dd_hh.csv.gz:

例如:

francescototti_2019_05_01_00.csv.gz,
francescototti_2019_05_01_01.csv.gz,
francescototti_2019_05_01_02.csv.gz,
.....
francescototti_2019_05_01_23.csv.gz,
francescototti_2019_05_02_00.csv.gz

每个小时文件大约30 MB.我希望最终的配置单元表按存储为orc文件的日期进行分区.

Each hourly file is about 30 MB. I would like the final hive table to be partitioned by day stored as orc files.

做到这一点的最佳方法是什么?我想了几种方法,可能是以下一种.

What is the best way to do this? I imagine a few ways, potentially one of the following.

  1. 一个自动脚本,用于获取每小时的小时文件并将其移动到s3存储桶中的相应的day文件夹中.在此新结构化的s3存储桶上创建分区的外部表.

  1. an automated script to take the days hourly files and move them into corresponding day folder in the s3 bucket. Create partitioned external table over this newly structured s3 bucket.

在原始s3位置的顶部有一个外部配置单元表,并创建了一个附加的已分区配置单元表,该表已从原始表插入.

have an external hive table on top of the raw s3 location and create an additional paritioned hive table that gets inserted into from the the raw table.

每种优点/缺点是什么?还有其他建议吗?

What are the pros/cons of each? Any other recommendations?

推荐答案

第一个选项:(自动脚本,用于获取小时小时文件并将其移动到s3存储桶中的相应日文件夹中.创建在此新结构化的s3存储桶上分区的外部表)看起来比在原始s3位置顶部构建文件要好,因为原始位置包含的文件太多,并且查询工作会很慢,因为它即使您是按INPUT__FILE__NAME虚拟列进行过滤的,也会列出所有这些列,并且如果您在其中接收到新文件,则情况会更糟.

First option: (an automated script to take the days hourly files and move them into corresponding day folder in the s3 bucket. Create partitioned external table over this newly structured s3 bucket) looks better than building files on top of the raw s3 location because raw location contains too many files and query will work slow because it will list all of them even if you are filtering by INPUT__FILE__NAME virtual column and if you receiving fresh files in it then it will get even worse.

如果文件太多,例如原始文件夹中有数百个文件,并且文件没有增长,那么我会选择option2.

If there are not too many files, say hundreds in raw folder and it is not growing then I would chose option2.

选项一的可能弊端是删除文件并反复读取/列出文件夹后最终出现一致性问题.删除大量文件(一次删除数千个文件)后,您肯定会在接下来的大约1个小时内遇到最终的一致性问题(虚拟文件).如果您不想一次删除太多文件.看来您不是,您一次将移动24个文件,然后很有可能不会在s3中遇到最终的一致性问题.另一个缺点是移动文件的成本.但是无论如何,这比读取/列出同一文件夹中的太多文件要好.

The possible drawback of option one is eventual consistency issue after removing files and repeatedly reading/listing the folder. After removing big number of files (say thousands at a time) you will definitely hit the eventual consistency issue (phantom files) during next 1 hour or so. If you are not going to remove too many files at a time. And it seems you are not, you will move 24 files at a time, then with very high probability you will not hit eventual consistency problem in s3. Another drawback is that moving files costs. But anyway it is better than reading/listing too many files in the same folder.

因此,选项一看起来更好.

So, option one looks better.

其他建议:重写上游进程,以便将文件写入日常文件夹.这是最好的选择.在这种情况下,您可以在s3顶级位置之上构建表,并且每天仅添加每日分区.分区修剪将正常工作,您无需移动文件,也不会出现s3不一致的问题.

Other recommendations: Rewrite upstream process so that it writes files into daily folders. This is the best option. In this case you can build table on top of s3 top level location and every day only add daily partition. Partition pruning will work fine and you do not need to move files and no issue with inconsistency in s3.

这篇关于如何将Amazon S3对象移动到分区目录中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆