为什么HDFS中的块很大？ [英] Why Is a Block in HDFS So Large?

查看：178 发布时间：2018/5/31 19:13:36 hadoop mapreduce hdfs

本文介绍了为什么HDFS中的块很大？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有人可以解释这个计算并给出一个清晰的解释吗？

快速计算表明，如果寻找时间大约为10 ms，传输速率为100 MB / s，为使查找时间为传输时间的1％，我们需要使块大小为100 MB左右。默认值实际上是64 MB，尽管许多HDFS安装使用128 MB块。随着新一代磁盘驱动器的传输速度的增加，这个数字将继续向上修正。

解决方案

块将作为磁盘上的连续信息存储，这意味着完全读取它的总时间是找到它的时间（寻找时间）+读取其内容的时间，而不需要做更多的搜索，即 sizeOfTheBlock / transferRate = transferTime 。

如果我们保持比值 seekTime / transferTime 小（文中接近0.01），这意味着我们正在从磁盘读取数据的速度几乎与磁盘所施加的物理限制一样快，最少花费时间查找信息。

这很重要，因为在map reduce作业中，我们通常遍历（读取）整个数据集（由HDFS文件或文件夹或一组文件夹表示）因为我们必须花费完整的 transferTime 来获取磁盘中的所有数据，所以我们尽量减少大块查找和读取的时间，因此数据块的尺寸很大。

在更传统的磁盘访问软件中，我们通常不会每次都读取整个数据集，所以我们宁愿花更多的时间在较小的块上进行大量的搜索而不是浪费时间传输我们不需要的太多数据。
Can somebody explain this calculation and give a lucid explanation?

A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.

解决方案
A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime.

If we keep the ratio seekTime / transferTime small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.

This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.

In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.

这篇关于为什么HDFS中的块很大？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么HDFS中的块很大？ [英] Why Is a Block in HDFS So Large?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

为什么HDFS中的块很大？ [英] Why Is a Block in HDFS So Large?

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭