数据局部性,如果HDFS不使用 [英] Data locality if HDFS not used

查看:308
本文介绍了数据局部性,如果HDFS不使用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当您为HDS提供不同于HDS的存储(如MySql服务器等)时,Hadoop的Map / Reduce部分的数据局部性功能会发生什么变化?换句话说,我的理解是Hadoop Map / Reduce使用数据局部性来尝试在与数据相同的节点上启动一个映射任务,但是当数据存储在sql服务器中时,任务节点上没有本地数据所有数据都在sql server节点中。那么我们在这种情况下是否会丢失数据局部性,或者数据局部性的定义是否正在发生变化?如果它发生变化,那么新的定义是什么?

解决方案

如果数据不在集群中,则没有数据位置。所有数据必须从远程源复制。这与任务无法在包含HDFS中的数据的节点上运行的情况相同。有几种使用远程资源的输入格式,包括S3,HBase和DB。如果你可以把你的数据放在HDFS中,那很好。我经常使用Mongo作为远程源,用于经常更新的少量数据,我对结果感到满意。

What happens to data locality feature of Map/Reduce portion of Hadoop when you provide it with a different storage other than HDFS like a MySql server and so on? In other words, my understanding is that Hadoop Map/Reduce uses data locality to try to launch a map task on the same node that the data is but when the data is stored in sql sever, there is no local data on the task node as all data are in the sql server node. So do we lose the data locality in that case or the definition of the data locality is changing? If it changes, what is the new defintion?

解决方案

There is no data locality if the data is not in the cluster. All the data must be copied from the remote source. This is the same as if the task cannot be run on a node that contains the data in HDFS. There are several input formats that use remote sources including S3, HBase and DB. If you can put your data in HDFS that is great. I use Mongo as a remote source quite regularly for small amounts of data that is frequently updated and I have been happy with the results.

这篇关于数据局部性,如果HDFS不使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆