Redshift Spectrum:按日期/文件夹自动分区表 [英] Redshift Spectrum: Automatically partition tables by date/folder

查看:622
本文介绍了Redshift Spectrum:按日期/文件夹自动分区表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们当前生成每日CSV导出,并将其上传到S3存储桶中,格式如下:

We currently generate a daily CSV export that we upload to an S3 bucket, into the following structure:

<report-name>
|--reportDate-<date-stamp>
    |-- part0.csv.gz
    |-- part1.csv.gz

我们希望能够运行按每日导出划分的报告。

We want to be able to run reports partitioned by daily export.

根据页面,您可以通过基于源S3文件夹的密钥对Redshift Spectrum中的数据进行分区表将其数据作为源。但是,从该示例来看,您似乎需要为每个分区使用 ALTER 语句:

According to this page, you can partition data in Redshift Spectrum by a key which is based on the source S3 folder where your Spectrum table sources its data. However, from the example, it looks like you need an ALTER statement for each partition:

alter table spectrum.sales_part
add partition(saledate='2008-01-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2008-01/';

alter table spectrum.sales_part
add partition(saledate='2008-02-01') 
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';

是否可以通过任何方式设置表,以便数据由其所在的文件夹自动分区,还是我们需要日常工作来更改表以添加当天的分区?

Is there any way to set the table up so that data is automatically partitioned by the folder it comes from, or do we need a daily job to ALTER the table to add that day's partition?

推荐答案

解决方案1:

每个表最多可以创建20000个分区。您可以创建一个一次性脚本,以为将来的所有s3分区文件夹添加分区(最大20k)。

At max 20000 partitions can be created per table. You can create a one-time script to add the partitions (at max 20k) for all the future s3 partition folders.

例如。

如果文件夹s3:// bucket / ticket / spectrum / sales_partition / saledate = 2017-12 /不存在,您甚至可以为其添加分区。

If folder s3://bucket/ticket/spectrum/sales_partition/saledate=2017-12/ doesn't exist, you can even add partition for that.

alter table spectrum.sales_part
add partition(saledate='2017-12-01') 
location 's3://bucket/tickit/spectrum/sales_partition/saledate=2017-12/';

解决方案2:

https://aws.amazon.com/blogs/big-data/data-lake-ingestion-automatically-partition-hive-external-tables-with-aws/

这篇关于Redshift Spectrum:按日期/文件夹自动分区表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆