S3和EMR数据位置 [英] S3 and EMR data locality

查看:120
本文介绍了S3和EMR数据位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用MapReduce和HDFS进行数据本地化非常重要(Spark,HBase也是如此)。在集群中将AWS部署到他们的云中时,我一直在研究它们:

Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:


  • EC2

  • EMR + S3

由于不同的原因,第二个选项似乎更具吸引力,其中最有趣的是扩展存储容量和分别进行处理,并在不需要时关闭处理(更正确的做法是,仅在需要时将其打开)。 是说明使用S3的优点的示例。

The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.

困扰我的是问题数据局部性。如果数据存储在S3中,则每次运行作业时都需要将其拉至HDFS。我的问题是-这个问题的规模有多大,仍然值得吗?

What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?

让我感到欣慰的是,我将只提取数据。

What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.

我希望从具有实际经验的人那里得到答案。谢谢。

I'm hopping for an answer from some person having practical experience with this. Thank you.

推荐答案

EMR不会将数据从S3拉到HDFS。它在S3上使用自己的HDFS支持实现(就像您在实际的HDFS上运行一样)。 https://docs.aws.amazon.com/emr/最新数据/ManagementGuide/emr-fs.html

EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html

关于数据本地性,S3为 RACK_LOCAL EMR火花簇。

As for data locality, S3 is RACK_LOCAL to EMR spark clusters.

这篇关于S3和EMR数据位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆