将数据(增量)加载到 Amazon Redshift、S3、DynamoDB 和 Insert [英] Loading data (incrementally) into Amazon Redshift, S3 vs DynamoDB vs Insert

查看:26
本文介绍了将数据(增量)加载到 Amazon Redshift、S3、DynamoDB 和 Insert的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Web 应用程序需要发送有关其使用情况的报告,我想为此目的使用 Amazon RedShift 作为数据仓库,我应该如何收集数据?

每次用户与我的应用程序交互时,我都想报告.. 那么我应该什么时候将文件写入 S3 呢?有多少?我的意思是:- 如果不立即发送信息,那么我可能会因为连接丢失而丢失它,或者在收集信息并准备发送到 S3 时系统中的一些错误...- 如果我确实在每次用户交互时将文件写入 S3,我最终会得到数百个文件(每个文件上的数据最少),需要在复制到 RedShift 后对其进行管理、排序、删除......这似乎不是就像一个很好的解决方案.

我错过了什么?我应该改用 DynamoDB,我应该改用简单的插入 Redshift 吗?
如果我确实需要将数据写入 DynamoDB,是否应该在复制后删除保留表.. 最佳实践是什么?

无论如何,在 RedShift 中避免数据重复的最佳做法是什么?

感谢您的帮助!

解决方案

最好先聚合事件日志,然后再将它们摄取到 Amazon Redshift.

好处是:

  • 您将更好地使用 Redshift 的并行特性;COPY 对 S3 中的一组较大文件(或来自大型 DynamoDB 表)将比单个 INSERT 或 COPY 小文件快很多.

  • 您可以在将数据加载到 Redshift 之前对其进行预排序(尤其是如果排序基于事件时间).这也可以提高您的负载性能并减少对 VACUUM 表的需求.

您可以在多个地方累积您的事件,然后再将它们聚合并加载到 Redshift:

  • 本地文件到 S3 - 最常见的方法是在客户端/服务器上汇总您的日志,然后每 x MB 或 y 分钟将它们上传到 S3.有许多日志附加程序支持此功能,您无需对代码进行任何修改(例如,FluentDLog4J).这只能通过容器配置来完成.不利的一面是您可能会丢失一些日志,并且可以在上传之前删除这些本地日志文件.

  • DynamoDB - 正如@Swami 所述,DynamoDB 是一种非常好的累积事件的方法.

  • Amazon Kinesis - 最近发布的服务也是流式传输的好方法以快速可靠的方式将您的活动从各种客户端和服务器发送到中心位置.这些事件是按插入顺序排列的,这样可以很容易地稍后将其加载到预先排序到 Redshift 中.事件在 Kinesis 中存储 24 小时,您可以安排每小时从 kinesis 读取数据并加载到 Redshift,以提高性能.

请注意,所有这些服务(S3、SQS、DynamoDB 和 Kinesis)都允许您直接从最终用户/设备推送事件,而无需通过中间网络服务器.这可以显着提高您的服务的高可用性(如何处理增加的负载或服务器故障)和系统成本(您只需为使用的内容付费,而无需为日志使用未充分利用的服务器).

例如,查看如何在此处获取移动设备的临时安全令牌:http://aws.amazon.com/articles/4611615499399490

允许与这些服务直接交互的另一组重要工具是各种 SDK.例如对于 Java.NETJavaScriptiOSAndroid.

关于去重要求;在上面的大多数选项中,您可以在聚合阶段执行此操作,例如,当您从 Kinesis 流中读取时,您可以检查您的事件中没有重复,但在放置之前分析大量事件缓冲区进入数据存储.

但是,您也可以在 Redshift 中执行此检查.一个好的做法是将数据 COPY 到临时表中,然后 SELECTINTO 一个组织良好且排序良好的表格.

您可以实施的另一个最佳实践是每天(或每周)表分区.即使您想要一个大而长的事件表,但您的大多数查询都在一天(例如最后一天)运行,您也可以创建一组具有相似结构的表(events_01012014、events_01022014、events_01032014...).然后您可以 SELECT INTO ... WHERE date = ... 到每个表.如果要查询多天的数据,可以使用UNION_ALL.p>

I have a web app that needs to send reports on its usage, I want to use Amazon RedShift as a data warehouse for that purpose, How should i collect the data ?

Every time, the user interact with my app, i want to report that.. so when should i write the files to S3 ? and how many ? What i mean is: - If do not send the info immediately, then I might lose it as a result of a connection lost, or from some bug in my system while its been collected and get ready to be sent to S3... - If i do write files to S3 on each user interaction, i will end up with hundreds of files (on each file has minimal data), that need to be managed, sorted, deleted after been copied to RedShift.. that dose not seems like a good solution .

What am i missing? Should i use DynamoDB instead, Should i use simple insert into Redshift instead !?
If i do need to write the data to DynamoDB, should i delete the hold table after been copied .. what are the best practices ?

On any case what are the best practices to avoid data duplication in RedShift ?

Appreciate the help!

解决方案

It is preferred to aggregate event logs before ingesting them into Amazon Redshift.

The benefits are:

  • You will use the parallel nature of Redshift better; COPY on a set of larger files in S3 (or from a large DynamoDB table) will be much faster than individual INSERT or COPY of a small file.

  • You can pre-sort your data (especially if the sorting is based on event time) before loading it into Redshift. This is also improve your load performance and reduce the need for VACUUM of your tables.

You can accumulate your events in several places before aggregating and loading them into Redshift:

  • Local file to S3 - the most common way is to aggregate your logs on the client/server and every x MB or y minutes upload them to S3. There are many log appenders that are supporting this functionality, and you don't need to make any modifications in the code (for example, FluentD or Log4J). This can be done with container configuration only. The down side is that you risk losing some logs and these local log files can be deleted before the upload.

  • DynamoDB - as @Swami described, DynamoDB is a very good way to accumulate the events.

  • Amazon Kinesis - the recently released service is also a good way to stream your events from the various clients and servers to a central location in a fast and reliable way. The events are in order of insertion, which makes it easy to load it later pre-sorted to Redshift. The events are stored in Kinesis for 24 hours, and you can schedule the reading from kinesis and loading to Redshift every hour, for example, for better performance.

Please note that all these services (S3, SQS, DynamoDB and Kinesis) allow you to push the events directly from the end users/devices, without the need to go through a middle web server. This can significantly improve the high availability of your service (how to handle increased load or server failure) and the cost of the system (you only pay for what you use and you don't need to have underutilized servers just for logs).

See for example how you can get temporary security tokens for mobile devices here: http://aws.amazon.com/articles/4611615499399490

Another important set of tools to allow direct interaction with these services are the various SDKs. For example for Java, .NET, JavaScript, iOS and Android.

Regarding the de-duplication requirement; in most of the options above you can do that in the aggregation phase, for example, when you are reading from a Kinesis stream, you can check that you don't have duplications in your events, but analysing a large buffer of events before putting into the data store.

However, you can do this check in Redshift as well. A good practice is to COPY the data into a staging tables and then SELECT INTO a well organized and sorted table.

Another best practice you can implement is to have a daily (or weekly) table partition. Even if you would like to have one big long events table, but the majority of your queries are running on a single day (the last day, for example), you can create a set of tables with similar structure (events_01012014, events_01022014, events_01032014...). Then you can SELECT INTO ... WHERE date = ... to each of this tables. When you want to query the data from multiple days, you can use UNION_ALL.

这篇关于将数据(增量)加载到 Amazon Redshift、S3、DynamoDB 和 Insert的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆