Amazon Redshift:从S3批量插入vs COPYing [英] Amazon redshift: bulk insert vs COPYing from s3

查看:231
本文介绍了Amazon Redshift:从S3批量插入vs COPYing的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用于某些分析应用程序的redshift集群.我有要添加到clicks表中的传入数据.假设我每秒要存储约10个新点击".如果可能的话,我希望我的数据能够在redshift中尽快提供.

I have a redshift cluster that I use for some analytics application. I have incoming data that I would like to add to a clicks table. Let's say I have ~10 new 'clicks' that I want to store each second. If possible, I would like my data to be available as soon as possible in redshift.

据我了解,由于列式存储,插入性能很差,因此必须分批插入.我的工作流程是将点击次数存储在redis中,每隔一分钟,我就会将〜600次点击从redis插入到redshift中.

From what I understand, because of the columnar storage, insert performance is bad, so you have to insert by batches. My workflow is to store the clicks in redis, and every minute, I insert the ~600 clicks from redis to redshift as a batch.

我有两种将一批点击插入Redshift的方法:

I have two ways of inserting a batch of clicks into redshift:

  • Multi-row insert strategy:我使用常规的insert查询来插入多行. 此处为多行插入文档
  • S3 Copy strategy:我将s3中的行复制为clicks_1408736038.csv.然后,我运行COPY将其加载到clicks表中. 在此处复制文档
  • Multi-row insert strategy: I use a regular insert query for inserting multiple rows. Multi-row insert documentation here
  • S3 Copy strategy: I copy the rows in s3 as clicks_1408736038.csv. Then I run a COPY to load this into the clicks table. COPY documentation here

我已经做了一些测试(这是在已经有200万行的clicks表上完成的):

I've done some tests (this was done on a clicks table with already 2 million rows):

             | multi-row insert stragegy |       S3 Copy strategy    |
             |---------------------------+---------------------------+
             |       insert query        | upload to s3 | COPY query |
-------------+---------------------------+--------------+------------+
1 record     |           0.25s           |     0.20s    |   0.50s    |
1k records   |           0.30s           |     0.20s    |   0.50s    |
10k records  |           1.90s           |     1.29s    |   0.70s    |
100k records |           9.10s           |     7.70s    |   1.50s    |

正如您所看到的那样,就性能而言,通过先在s3中复制数据,我似乎一无所获. upload + copy时间等于insert时间.

As you can see, in terms of performance, it looks like I gain nothing by first copying the data in s3. The upload + copy time is equal to the insert time.

问题:

每种方法的优缺点是什么?最佳做法是什么?我有想念什么吗?

What are the advantages and drawbacks of each approach ? What is the best practise ? Did I miss anything ?

还有一个问题:是否可以通过清单将s3中的数据自动红移到COPY?我的意思是,一旦将新的.csv文件添加到s3中,就立即复制数据? Doc 此处和<请在href ="http://docs.aws.amazon.com/redshift/latest/dg/load-from-host-steps-create-manifest.html" rel ="nofollow noreferrer">此处.还是我必须自己创建后台工作人员才能触发COPY命令?

And side question: is it possible for redshift to COPY the data automatically from s3 via a manifest ? I mean COPYing the data as soon as new .csv files are added into s3 ? Doc here and here. Or do I have to create a background worker myself to trigger the COPY commands ?

我的快速分析:

有关一致性的文档中 ,没有提到通过多行插入来加载数据.似乎首选的方法是使用唯一的对象键从s3中进行COPY(s3上的每个.csv都有其自己的唯一名称)...

In the documentation about consistency, there is no mention about loading the data via multi-row inserts. It looks like the preferred way is COPYing from s3 with unique object keys (each .csv on s3 has its own unique name)...

  • S3 Copy strategy:
    • PROS:看起来像文档中的优良作法.
    • 缺点:需要做更多的工作(我必须管理存储桶和清单以及触发COPY命令的cron ...)
    • S3 Copy strategy:
      • PROS: looks like the good practice from the docs.
      • CONS: More work (I have to manage buckets and manifests and a cron that triggers the COPY commands...)
      • 优点:减少工作量.我可以从我的应用程序代码中调用insert查询
      • 缺点:看起来不像是导入数据的标准方法.我想念什么吗?

      推荐答案

      Redshift是一个分析数据库,它经过优化,可让您查询数以亿计的记录.它还进行了优化,使您可以使用COPY命令将这些记录非常快速地提取到Redshift中.

      Redshift is an Analytical DB, and it is optimized to allow you to query millions and billions of records. It is also optimized to allow you to ingest these records very quickly into Redshift using the COPY command.

      COPY命令的设计是将多个文件并行加载到集群的多个节点中.例如,如果您有一个5个小节点(dw2.xl)群集,那么如果您的数据是多个文件(例如20个),则可以将数据复制速度提高10倍.文件数量和每个文件中的记录数量之间是平衡的,因为每个文件的开销很小.

      The design of the COPY command is to work with parallel loading of multiple files into the multiple nodes of the cluster. For example, if you have a 5 small node (dw2.xl) cluster, you can copy data 10 times faster if you have your data is multiple number of files (20, for example). There is a balance between the number of files and the number of records in each file, as each file has some small overhead.

      这将使您在COPY频率(例如每5或15分钟而不是每30秒)与事件文件的大小和数量之间取得平衡.

      This should lead you to the balance between the frequency of the COPY, for example every 5 or 15 minutes and not every 30 seconds, and the size and number of the events files.

      要考虑的另一点是Redshift节点的两种类型,分别是SSD节点(dw2.xl和dw2.8xl)和磁性节点(dx1.xl和dw1.8xl). SSD的摄取速度也更快.由于您正在寻找非常新鲜的数据,因此您可能更喜欢与SSD一起运行,SSD的成本通常较低,少于500GB的压缩数据.如果随着时间的推移,您拥有超过500GB的压缩数据,则可以考虑运行2个不同的群集,其中一个用于存储SSD上的热"数据,其中包含最近一周或一个月的数据,另一个用于存储磁盘上的冷"数据,其中包含所有您的历史数据.

      Another point to consider is the 2 types of Redshift nodes you have, the SSD ones (dw2.xl and dw2.8xl) and the magnetic ones (dx1.xl and dw1.8xl). The SSD ones are faster in terms of ingestion as well. Since you are looking for very fresh data, you probably prefer to run with the SSD ones, which are usually lower cost for less than 500GB of compressed data. If over time you have more than 500GB of compressed data, you can consider running 2 different clusters, one for "hot" data on SSD with the data of the last week or month, and one for "cold" data on magnetic disks with all your historical data.

      最后,您实际上不需要将数据上传到S3中,这是摄取时间的主要部分.您可以使用SSH COPY选项直接从服务器复制数据.在此处查看有关此信息的更多信息: http://docs.aws.amazon.com/redshift/latest/dg/loading-data-from-remote-hosts.html

      Lastly, you don't really need to upload the data into S3, which is the major part of your ingestion timing. You can copy the data directly from your servers using the SSH COPY option. See more information about it here: http://docs.aws.amazon.com/redshift/latest/dg/loading-data-from-remote-hosts.html

      如果您能够将Redis队列拆分为多个服务器,或者至少将多个队列拆分为具有不同日志文件的多个队列,则每秒可能会获得非常好的记录记录.

      If you are able to split your Redis queues to multiple servers or at least multiple queues with different log files, you can probably get very good records per second ingestion speed.

      您可能要考虑允许近乎实时分析的另一种模式是使用流媒体服务Amazon Kinesis.它允许您在几秒钟的延迟内对数据进行分析,同时准备以更优化的方式将数据复制到Redshift中.

      Another pattern that you may want to consider to allow near real time analytics is the usage of Amazon Kinesis, the streaming service. It allows to run analytics on data in delay of seconds, and in the same time prepare the data to copy into Redshift in a more optimized way.

      这篇关于Amazon Redshift:从S3批量插入vs COPYing的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆