如何通过自定义REST API将数据加载到Redshift中 [英] How to load data into Redshift from a custom REST API

查看:52
本文介绍了如何通过自定义REST API将数据加载到Redshift中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是AWS的新手,如果以前曾问过这个问题,请原谅我.

I am new to AWS and please forgive me if this question is asked previously.

我有一个REST API,该API返回2个参数(名称,电子邮件).我想将此数据加载到Redshift中.

I have a REST API which returns 2 parameters (name, email). I want to load this data into Redshift.

我想到制作一个每2分钟启动一次并调用REST API的Lambda函数.该API可能在这2分钟内最多返回3-4条记录.

I thought of making a Lambda function which starts every 2 minutes and call the REST API. The API might return max 3-4 records within this 2 minutes.

因此,在这种情况下,可以只执行插入操作,还是我仍要使用COPY(使用S3)?我只担心性能和无错误(稳健)数据插入.

So, under this situation is it okay to just do a insert operation or I have to still use COPY (using S3)? I am worried only about performance and error-free (robust) data insert.

此外,Lambda函数将每2分钟异步启动,因此插入操作可能会重叠(但数据不会重叠).

Also, the Lambda function will start asynchronously every 2 mins, so there might be a overlap of insert operation (but there won't be an overlap in data).

在这种情况下,如果我使用S3选项,我担心由先前的Lambda调用生成的S3文件将被覆盖并且发生冲突.

At this situation and if I go with S3 option, I am worried the S3 file generated by previous Lambda invoke will be overwritten and a conflict occurs.

长话短说,将较少的记录插入redshift的最佳实践是什么?

Long story short, what is the best practise to insert fewer records into redshift?

PS:我也可以使用其他AWS组件.我什至研究了Firehose,它对我来说很完美,但是它无法将数据加载到Private Subnet Redshift中.

PS: I am okay with using other AWS components as well. I even looked into Firehose which is perfect for me but it can't load data into Private Subnet Redshift.

预先感谢

推荐答案

是的, INSERT 少量的数据就可以了.

Yes, it would be fine to INSERT small amounts of data.

始终建议通过 COPY 命令加载的建议是针对大量数据,因为 COPY 加载在多个节点上并行进行.但是,只需几行,您就可以使用 INSERT 而不会感到内gui.

The recommendation to always load via a COPY command is for large amounts of data because COPY loads are parallelized across multiple nodes. However, for just a few lines, you can use INSERT without feeling guilty.

如果您的 SORTKEY 是一个时间戳,并且您按时间顺序加载数据,则由于已经对数据进行了排序,因此执行 VACUUM 的需求也就更少了.但是,如果要删除行,则优良作法是仍然定期 VACUUM 该表.

If your SORTKEY is a timestamp and you are loading data in time order, there is also less need to perform a VACUUM, since the data is already sorted. However, it is good practice to still VACUUM the table regularly if rows are being deleted.

这篇关于如何通过自定义REST API将数据加载到Redshift中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆