HDFS复制因子 [英] HDFS replication factor

查看:142
本文介绍了HDFS复制因子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我将文件上传到HDFS时,如果将复制因子设置为1,那么文件将分割成一台计算机,或者分割将分布到整个网络中的多台计算机上?

  hadoop fs -D dfs.replication = 1 -copyFromLocal file.txt / user / ablimit 


解决方案

根据


Hadoop的默认策略是将第一个副本在与客户端相同的节点上(对于在群集外运行的
客户端,随机选择一个节点,尽管系统
尝试不挑选太满或太忙的节点)。第二个副本放置在第一个(机架外)的
不同机架上,随机选择。第三个副本放置在
与第二个副本相同的机架上,但在随机选择的不同节点上。更多复制品
放置在群集上的随机节点上,尽管系统试图避免在同一个机架上放置
太多的复制品。


这个逻辑很有意义,因为它减少了不同节点之间的网络喋喋不休。但是,这本书于2009年发布,Hadoop框架发生了很多变化。



我认为这取决于客户端是否与Hadoop相同是否节点。如果客户端是Hadoop节点,则所有分割将位于同一节点上。尽管集群中有多个节点,但这并没有提供更好的读/写吞吐量。如果客户端与Hadoop节点不同,那么随机为每个分裂选择节点,因此分裂分散在集群中的节点上。现在,这提供了更好的读/写吞吐量。

写入多个节点的一个优点是,即使其中一个节点出现故障,但至少有一些数据可以从剩余的分割中以某种方式恢复。


When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?

hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit

解决方案

According to the Hadoop : Definitive Guide

Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.

This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.

I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.

One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.

这篇关于HDFS复制因子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆