计划从AWS Redshift到S3的数据提取 [英] Scheduling data extraction from AWS Redshift to S3

查看:453
本文介绍了计划从AWS Redshift到S3的数据提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试构建一个作业,用于从Redshift提取数据并将相同的数据写入S3存储桶. 到目前为止,我已经研究了AWS Glue,但是Glue无法在redshift上运行自定义sql.我知道我们可以运行卸载命令,并且可以直接将其存储到S3.我正在寻找可以在AWS中进行参数化和计划的解决方案.

I am trying to build out a job for extracting data from Redshift and write the same data to S3 buckets. Till now I have explored AWS Glue, but Glue is not capable to run custom sql's on redshift. I know we can run unload commands and can be stored to S3 directly. I am looking for a solution which can be parameterised and scheduled in AWS.

推荐答案

为此考虑使用AWS Data Pipeline.

Consider using AWS Data Pipeline for this.

AWS Data Pipeline是一项AWS服务,可让您定义和调度常规作业.这些作业称为管道.管道包含所需工作的业务逻辑,例如,从Redshift提取数据到S3.您可以安排管道运行,但是经常需要例如每天.

AWS Data Pipeline is AWS service that allows you to define and schedule regular jobs. These jobs are referred to as pipelines. Pipeline contains a business logic of the work required, for example, extracting data from Redshift to S3. You can schedule a pipeline to run however often you require e.g. daily.

管道是由您定义的,您甚至可以对其进行版本控制.您可以使用Data Pipeline Architect在浏览器中准备管道定义,也可以使用计算机上本地的JSON文件编写管道定义.管道定义由Redshift数据库,S3节点,SQL活动以及参数(例如,指定用于提取数据的S3路径)之类的组件组成.

Pipeline is defined by you, you can even version control it. You can prepare a pipeline definition in a browser using Data Pipeline Architect or compose it using JSON file locally on your computer. Pipeline definition is composed of components, such as, Redshift database, S3 node , SQL activity, as well as parameters, for example to specifying S3 path to use for extracted data.

AWS数据管道服务处理调度,管道中组件之间的依赖关系,监视和错误处理.

AWS Data Pipeline service handles scheduling, dependency between components in your pipeline, monitoring and error handling.

对于您的特定用例,我将考虑以下选项:

For your specific use case, I would consider the following options:

选项1

使用以下组件定义管道:SQLDataNode和S3DataNode. SQLDataNode将引用您的Redshift数据库和SELECT查询以用于提取数据. S3DataNode将指向用于存储数据的S3路径.您添加了一个CopyActivity活动,以将数据从SQLDataNode复制到S3DataNode.当此类管道运行时,它将使用SQLDataNode从Redshift检索数据,并使用CopyActivity将数据复制到S3DataNode.可以对S3DataNode中的S3路径进行参数化,因此每次运行管道时它都不同.

Define pipeline with the following components: SQLDataNode and S3DataNode. SQLDataNode would reference your Redshift database and SELECT query to use to extract your data. S3DataNode would point to S3 path to be used to store your data. You add a CopyActivity activity to copy data from SQLDataNode to S3DataNode. When such pipeline runs, it will retrieve data from Redshift using SQLDataNode and copy that data to S3DataNode using CopyActivity. S3 path in S3DataNode can be parameterised so it is different every time you run a pipeline.

选项2

首先,使用UNLOAD语句定义SQL查询,以用于将数据卸载到S3. (可选)您可以将其保存在文件中并上传到S3.使用SQLActivity组件来指定要在Redshift数据库中执行的SQL查询. SQLActivity中的SQL查询可以是对存储查询(可选)的S3路径的引用,也可以只是查询本身.每当管道运行时,它将连接到Redshift并执行将数据存储在S3中的SQL查询. 选项2的约束:在UNLOAD语句中,S3路径是静态的.如果打算将每个数据提取存储在单独的S3路径中,则每次运行它时,都必须修改UNLOAD语句以使用另一个S3路径,这不是开箱即用的功能.

Firstly, define SQL query with UNLOAD statement to be used to unload your data to S3. Optionally, you can save it in a file and upload to S3. Use SQLActivity component to specify SQL query to execute in Redshift database. SQL query in SQLActivity can be a reference to S3 path where you stored your query (optionally), or just a query itself. Whenever a pipeline runs, it will connect to Redshift and execute SQL query which stores the data in S3. Constraints of option 2: in UNLOAD statement, S3 path is static. If you plan to store every data extract in a separate S3 path, you will have to modify UNLOAD statement to use another S3 path every time you run it which is not out-of-the-box function.

这些管道在哪里运行?

在带有TaskRunner的EC2实例上,这是AWS提供的用于运行数据管道的工具.您可以在管道运行时自动启动该实例,也可以引用已安装TaskRunner的已运行实例.您必须确保允许EC2实例连接到Redshift数据库.

On EC2 instance with a TaskRunner, a tool provided by AWS to run data pipelines. You can start that instance automatically at the time when pipeline runs, or you can reference already running instance with a TaskRunner installed on it. You have to make sure that EC2 instance is allowed to connect to your Redshift database.

相关文档:

http://docs.aws.amazon. com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html

http://docs.aws.amazon. com/datapipeline/latest/DeveloperGuide/dp-object-redshiftdatabase.html

http://docs.aws.amazon. com/datapipeline/latest/DeveloperGuide/dp-object-sqldatanode.html

http://docs.aws.amazon. com/datapipeline/latest/DeveloperGuide/dp-object-sqlactivity.html

http://docs.aws. amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html

这篇关于计划从AWS Redshift到S3的数据提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆