AWS EMR性能HDFS与S3 [英] AWS EMR performance HDFS vs S3

查看:597
本文介绍了AWS EMR性能HDFS与S3的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在大数据中,代码被推向数据以执行.这是有道理的,因为数据量很大并且执行的代码相对较小.来到AWS EMR时,数据可以在HDFS或S3中.在S3的情况下,必须将数据拉到核心/任务节点以从其他一些节点执行.与HDFS中的数据相比,这可能会有些开销.

In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS.

最近,我注意到执行MR作业时,将日志文件放入S3的等待时间很长.有时,即使作业完成后,日志文件仍然需要花费几分钟的时间.

Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it took a couple of minutes for the log files to appear even after the job has been completed.

对此有何想法?是否有人使用HDFS vs S3中的数据掌握MR作业完成的指标?

Any thoughts on this? Does anyone have metrics for the MR job completion with the data in HDFS vs S3?

推荐答案

这在另一个层面上是有问题的.

That's problematic on a different level.

S3仅具有最终的一致性.由于写入过程被延迟,因此在您的代码写入某些内容(例如close()flush())后,您不会立即看到/可以读取.我认为这可能是由于为您写入的数据分配了免费资源.因此,这不是性能问题,而是您真正想要/需要的一致性.

S3 has only eventual consistency. You don't immediately see/can read after something was written by your code (e.g. a close() or flush()) , as the write process is delayed. I think this might be due to the allocation of free resources for the data you write. So it is not a problem of performance, but of the consistency you really want/need.

我在EMR上做什么?我启动了Hadoop集群,然后将作业所需的所有内容都放入HDFS.在S3上读取的时间开销要大得多,并且最终的一致性使ist基本上对缓冲作业之间的项目毫无用处.

What do I do on EMR? I startup a Hadoop cluster and put everything into HDFS what is needed by the job(s). Reads are much more expensive in time on S3 and the eventual consistency makes ist basically useless for buffering items between jobs.

但是,当从HDFS备份文件或使它们可用于其他实例或服务(例如CloudFront)时,S3非常有用.

However S3 is great when backing up files from your HDFS or making them available for other instances or services (e.g. CloudFront).

这篇关于AWS EMR性能HDFS与S3的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆