为什么 HDFS 中的块这么大? [英] Why Is a Block in HDFS So Large?

查看:62
本文介绍了为什么 HDFS 中的块这么大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释这个计算并给出一个清晰的解释吗?

Can somebody explain this calculation and give a lucid explanation?

快速计算表明,如果寻道时间约为 10 毫秒,传输速率为 100 MB/s,要使寻道时间占传输时间的 1%,我们需要将块大小设为 100 MB 左右.默认值实际上是 64 MB,尽管许多 HDFS 安装使用 128 MB 块.随着新一代磁盘驱动器传输速度的提高,该数字将继续向上修正.

A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.

推荐答案

一个block会作为一条连续的信息存储在磁盘上,也就是说完整读取它的总时间就是定位它的时间(seektime) + 在不进行任何搜索的情况下读取其内容的时间,即 sizeOfTheBlock/transferRate = transferTime.

A block will be stored as a contiguous piece of information on the disk, which means that the total time to read it completely is the time to locate it (seek time) + the time to read its content without doing any more seeks, i.e. sizeOfTheBlock / transferRate = transferTime.

如果我们保持 seekTime/transferTime 的比率很小(接近文本中的 0.01),这意味着我们从磁盘读取数据的速度几乎与磁盘施加的物理限制一样快,花最少的时间寻找信息.

If we keep the ratio seekTime / transferTime small (close to .01 in the text), it means we are reading data from the disk almost as fast as the physical limit imposed by the disk, with minimal time spent looking for information.

这很重要,因为在 map reduce 作业中,我们通常遍历(读取)整个数据集(由 HDFS 文件或文件夹或一组文件夹表示)并对其进行逻辑处理,因此我们必须花费全部 transferTime 无论如何要从磁盘中取出所有数据,让我们尽量减少查找和读取大块所花费的时间,从而减少数据块的大小.

This is important since in map reduce jobs we are typically traversing (reading) the whole data set (represented by an HDFS file or folder or set of folders) and doing logic on it, so since we have to spend the full transferTime anyway to get all the data out of the disk, let's try to minimise the time spent doing seeks and read by big chunks, hence the large size of the data blocks.

在更传统的磁盘访问软件中,我们通常不会每次都读取整个数据集,因此我们宁愿花更多时间在较小的块上进行大量寻道,而不是浪费时间传输太多我们不会传输的数据需要.

In more traditional disk access software, we typically do not read the whole data set every time, so we'd rather spend more time doing plenty of seeks on smaller blocks rather than losing time transferring too much data that we won't need.

这篇关于为什么 HDFS 中的块这么大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆