RDS到S3-数据转换AWS [英] RDS to S3 - Data Transformation AWS

查看:89
本文介绍了RDS到S3-数据转换AWS的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的RDS postgres/oracle中有大约30个表(尚未确定是oracle还是postgres)实例.我想获取最近4个小时内已插入/更新的所有记录(可配置),创建一个与每个表有关的csv文件,并将这些文件存储在S3中.我希望整个过程都是事务性的.如果从一个表中获取数据时出现任何错误,我不希望将与其他29个表相关的数据保留在S3中.数据不是很大,在4小时内,每个表中的记录数量应在100条以下,或者更少.

I have about 30 tables in my RDS postgres / oracle (haven't decided if it is oracle or postgres yet) instance. I want to fetch all the records that have been inserted / updated in the last 4 hours (configurable) , create a csv file pertaining to each table and store the files in S3. I want this whole process to be transactional. If there is any error in fetching data from one table , I don't want data pertinent to other 29 tables to be persisted in S3. The data isn't very large , it should be in the order of few 100 records or less in each table for the duration of 4 hours.

我正在考虑在EMR群集中执行火花作业,以从RDS提取数据,为每个表创建一个csv,然后在过程结束时将所有文件发布到S3.一旦将数据发布到S3,EMR群集将被破坏.cloudwatch触发器将每4小时调用一次lamda,这将启动一个新的EMR集群来执行此任务.

I am thinking of having a spark job in EMR cluster to fetch data from RDS , create a csv for each table and post all the files to S3 at the end of the process. The EMR cluster will be destroyed once data is posted to S3. A cloudwatch trigger will invoke a lamda every 4 hours which will spin up a new EMR cluster which performs this job.

是否有其他值得探索的替代方法?

Are there any alternate approaches worth exploring for this transformation?

推荐答案

看看 AWS胶在后台使用EMR,但您无需关心基础结构和配置,只需编写您的ETL作业.

Take a look at AWS Glue which is using EMR under the hood but you don't need to care about infrastructure and configurations, just setup crawler and write your ETL job.

请注意,AWS Glue不支持JDBC连接的谓词下推(

Please note that AWS Glue doesn't support predicates pushdown for JDBC connections (currently s3 only) so it means it will load entire table first and only then apply filtering.

此外,您还应该仔细考虑原子性,因为Glue ETL作业仅处理数据并写入到接收器而无需事务.万一发生故障,它不会删除部分写入的记录,因此您应该自己进行管理.我会考虑的选项很少:

Also you should carefully think about atomicity since Glue ETL job simply processes data and writes to a sink without transactions. In case of failure it won't remove partially written records so you should manage it by yourself. There are few options I would consider:

  1. 每次执行时将数据写入temp文件夹(本地或s3),然后使用 查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆