数据局部性,如果HDFS不使用 [英] Data locality if HDFS not used
问题描述
如果数据不在集群中,则没有数据位置。所有数据必须从远程源复制。这与任务无法在包含HDFS中的数据的节点上运行的情况相同。有几种使用远程资源的输入格式,包括S3,HBase和DB。如果你可以把你的数据放在HDFS中,那很好。我经常使用Mongo作为远程源,用于经常更新的少量数据,我对结果感到满意。
What happens to data locality feature of Map/Reduce portion of Hadoop when you provide it with a different storage other than HDFS like a MySql server and so on? In other words, my understanding is that Hadoop Map/Reduce uses data locality to try to launch a map task on the same node that the data is but when the data is stored in sql sever, there is no local data on the task node as all data are in the sql server node. So do we lose the data locality in that case or the definition of the data locality is changing? If it changes, what is the new defintion?
There is no data locality if the data is not in the cluster. All the data must be copied from the remote source. This is the same as if the task cannot be run on a node that contains the data in HDFS. There are several input formats that use remote sources including S3, HBase and DB. If you can put your data in HDFS that is great. I use Mongo as a remote source quite regularly for small amounts of data that is frequently updated and I have been happy with the results.
这篇关于数据局部性,如果HDFS不使用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!