如何在Hadoop中设置数据块大小?改变它有利吗? [英] How to set data block size in Hadoop ? Is it advantage to change it?

查看:159
本文介绍了如何在Hadoop中设置数据块大小?改变它有利吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我们可以在Hadoop中更改数据块大小,请让我知道如何做到这一点。
是否有利于更改块大小,如果是,那么让我知道为什么以及如何?如果没有,那么让我知道为什么和怎么样?

解决方案

这个主题似乎有很多混淆,周围。为了解除混淆,有助于思考HDFS是如何实现的:



HDFS是对基于分布式磁盘的文件系统的抽象。所以块和块大小这两个词的含义与通常理解的含义不同。对于HDFS,文件仅仅是一个块的集合,每个块作为回报都被存储为一个datanode上的实际文件。事实上,根据复制因素,相同的文件存储在多个datanode上。这些单个文件的块大小以及它们的其他性能特征依次取决于各个datanode的底层文件系统。

HDFS文件与单个文件之间的映射datanode由namenode维护
。但namenode并不期望特定的块大小,它只是存储在创建HDFS文件时创建的
映射,通常根据默认的 dfs分割


这意味着,例如,如果您有1 MB的复制文件和3个块大小64
MB,你不会失去63 MB * 3 = 189 MB,因为物理上只有三个1 MB文件被存储在
中,并且具有标准块大小的基础文件系统(例如ext4)。



所以问题就变成了一个好的 dfs.blocksize ,如果建议改变它的话。
让我首先列出一些方面,以说明更大的块大小:



  1. Namenode压力:如上所述,namenode必须维护映射在dfs文件和它们的块到datanodes上的物理文件之间。因此,越少的块/文件越少,它具有的内存压力和通信开销

  2. 磁盘吞吐量:文件是由hadoop中的单个进程写入的,这通常会导致数据顺序写入磁盘。这对旋转磁盘特别有利,因为它避免了昂贵的搜寻。如果数据是以这种方式写入的,那么也可以这样读取数据,所以它成为读取和写入的优点。实际上,这种与本地数据相结合的优化(即进行数据处理)是mapreduce的主要思想之一。
  3. 网络吞吐量:数据局部性是更重要的优化,但在分布式系统中,这并不总是可以实现的,所以有时需要在节点之间复制数据。通常,一个文件(dfs块)通过一个持久的TCP连接进行传输,在传输大文件时可以达到更高的吞吐量。
  4. 更大的默认分割:即使分割大小可以在Job级别上配置,大多数人不会考虑这个,只是使用通常是块大小的默认值。如果你的splitsize太小了,你可能会得到太多的mapper,这些mapper不需要做太多的工作,这又会导致更小的输出文件,不必要的开销和许多被占用的容器,这些容器可能会导致其他工作挨饿。这也会对缩减阶段产生不利影响,因为结果必须从所有映射器中获取。



    当然,理想的分裂主要取决于你的工作类型,可以做。但是当你需要的时候,你总是可以设置一个较低的分割大小,而当你设置一个比块大小更高的分割大小时,你可能会失去一些数据局部性。



    因为HDFS中的块放置规则是:第一个块写入创建文件的进程运行的数据节点,第二个块位于同一机架中的另一个节点上,第二个节点位于同一个机架上,第三个节点位于节点上在另一个机架上。因此,通常可以在单个数据节点上找到每个文件块的一个副本,因此即使一个映射器正在读取几个块,也可以实现数据局部性,因为块大小是倍数。仍然在这种情况下,mapred框架只能选择一个节点而不是通常的三个节点来实现数据局部性,因此不能拒绝一个效果。



    但最终这一点更大的块大小可能是最弱的,因为可以根据需要独立设置分割大小。


我们应该将它设置为无穷大…


  1. 并行/分布:如果您的输入数据存在在少数几个节点上,即使是一个大集群也无法实现并行处理,至少如果您想维护一些数据局部性。作为一个规则,我会说一个好的块大小应该匹配你也可以接受的,作为默认工作负载的分裂。

  2. 容错和延迟:如果网络连接打破了重传较小的文件较少。 TCP吞吐量可能很重要,但单个连接不应该永远占用。

这些因素相互加权取决于您的数据类型,群集,工作负载等。但总的来说,我认为128 MB的默认块大小对于典型的使用情况已经有点低了。 512 MB或甚至1 GB可能值得考虑。

但是在你深入研究之前,你应该首先检查你的输入文件的大小。如果你的大部分文件很小,甚至没有达到最大默认块大小,你的块大小基本上总是文件大小,它无助于增加默认块大小。有一些解决方法,比如使用输入组合器来避免产生太多的映射器,但最终你需要确保你的输入文件足够大以利用大块大小。



如果您的文件已经很小,请不要通过使块大小更小来解决问题。


If we can change the data block size in Hadoop please let me know how to do that. Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how?

解决方案

There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented:

HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different meaning than generally understood. For HDFS a "file" is just a collection of blocks, each "block" in return is stored as an actual file on a datanode. In fact the same file is stored on several datanodes, according to the replication factor. The blocksize of these individual files and their other performance characteristics in turn depend on the underlying filesystems of the individual datanodes.

The mapping between an HDFS-File and the individual files on the datanodes is maintained by the namenode. But the namenode doesn't expect a specific blocksize, it just stores the mappings which where created during the creation of the HDFS file, which is usually split according to the default dfs.blocksize (but can be individually overwritten).

This means for example if you have 1 MB file with a replication of 3 and a blocksize of 64 MB, you don't lose 63 MB * 3 = 189 MB, since physically just three 1 MB files are stored with the standard blocksize of the underlying filesystems (e.g. ext4).

So the question becomes what a good dfs.blocksize is and if it's advisable to change it. Let me first list the aspects speaking for a bigger blocksize:

  1. Namenode pressure: As mentioned the namenode has to maintain the mappings between dfs files and their blocks to physical files on datanodes. So the less blocks/file the less memory pressure and communication overhead it has
  2. Disk throughput: Files are written by a single process in hadoop, which usually results in data written sequentially to disk. This is especially advantageous for rotational disks because it avoids costly seeks. If the data is written that way, it can also be read that way so it becomes an advantage for reads and writes. In fact this optimization in combination with data locally (i.e. do the processing where the data is) is one of the main ideas of mapreduce.
  3. Network throughput: Data locality is the more important optimization, but in a distributed system this can not always be achieved, so sometimes it's necessary to copy data between nodes. Normally one file (dfs block) is transferred via one persistent TCP connection which can reach a higher throughput when big files are transferred.
  4. Bigger default splits: even though the splitsize can be configured on Job level, most people don't consider this and just go with the default which is usually the blocksize. If your splitsize is too small though, you can end up with too many mappers which don't have much work to do which in turn can lead to even smaller output files, unnecessary overhead and many occupied containers which can starve other jobs. This also has an adverse affect on the reduce phase, since the results must be fetched from all mappers.

    Of course the ideal splitsize heavily depends on the kind of work you've to do. But you always can set a lower splitsize when necessary, whereas when you set a higher splitsize than the blocksize you might lose some data locality.

    The latter aspect is less of an issue than one would think though, because the rule for block placement in HDFS is: the first block is written on the datanode where the process creating the file runs, the second one on another node in the same rack and the third one on a node on another rack. So usually one replica for each block of a file can be found on a single datanode, so data locality can still be achieved even when one mapper is reading several blocks due to a splitsize which is a multiple of the blocksize. Still in this case the mapred framework can only select one node instead of the usual three to achieve data locality so an effect can't be denied.

    But ultimately this point for a bigger blocksize is probably the weakest of all, since one can set the splitsize independently if necessary.

But there also have to be arguments for a smaller blocksize otherwise we should just set it to infinity…

  1. Parallelism/Distribution: If your input data lies on just a few nodes even a big cluster doesn't help to achieve parallel processing, at least if you want to maintain some data locality. As a rule I would say a good blocksize should match what you also can accept as a splitsize for your default workload.
  2. Fault tolerance and latency: If a network connection breaks the perturbation of retransmitting a smaller file is less. TCP throughput might be important but individual connections shouldn't take forever either.

Weighting these factors against each other depends on your kind of data, cluster, workload etc. But in general I think the default blocksize 128 MB is already a little low for typical usecases. 512 MB or even 1 GB might be worth considering.

But before you even dig into that you should first check the size of your input files. If most of your files are small and don't even reach the max default blocksize your blocksize is basically always the filesize and it wouldn't help anything to increase the default blocksize. There are workarounds like using an input combiner to avoid spawning too many mappers, but ultimately you need to ensure your input files are big enough to take advantage of a big blocksize.

And if your files are already small don't compound the problem by making the blocksize even smaller.

这篇关于如何在Hadoop中设置数据块大小?改变它有利吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆