S3和EMR数据位置 [英] S3 and EMR data locality

查看：120 发布时间：2020/6/4 0:50:32 amazon-web-services hadoop amazon-s3 amazon-ec2 amazon-emr

本文介绍了S3和EMR数据位置的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

使用MapReduce和HDFS进行数据本地化非常重要（Spark，HBase也是如此）。在集群中将AWS部署到他们的云中时，我一直在研究它们：

Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:

EMR + S3

由于不同的原因，第二个选项似乎更具吸引力，其中最有趣的是扩展存储容量和分别进行处理，并在不需要时关闭处理（更正确的做法是，仅在需要时将其打开）。这是说明使用S3的优点的示例。

The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.

困扰我的是问题数据局部性。如果数据存储在S3中，则每次运行作业时都需要将其拉至HDFS。我的问题是-这个问题的规模有多大，仍然值得吗？

What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?

让我感到欣慰的是，我将只提取数据。

What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.

我希望从具有实际经验的人那里得到答案。谢谢。

I'm hopping for an answer from some person having practical experience with this. Thank you.

S3和EMR数据位置 [英] S3 and EMR data locality

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

S3和EMR数据位置 [英] S3 and EMR data locality

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭