Hadoop顺序数据访问 [英] Hadoop sequential data access

查看:133
本文介绍了Hadoop顺序数据访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据hadoop的权威指南

  HDFS是一个文件系统,专门用于存储流式或顺序数据访问模式的超大文件

什么是流或顺序数据访问?它将如何减少磁盘的寻道时间?

解决方案

这不是Hadoop特有的。

顺序访问模式是当您按顺序读取数据时(通常从头到尾)。考虑一个书的例子。阅读小说时,您使用连续顺序:从第1页开始,然后转到第2页等等。另一种常见模式称为随机访问。这是当你从一个地方跳到另一个地方时,甚至在读数据时甚至可能倒退。对于书中的例子,请考虑一本字典。你不会像读小说一样阅读它。相反,你在中间的某个地方搜索你的单词。当你完成了查找这个单词的时候,你可能会去寻找另一个单词,它距离你现在打开书的地方有数百页的距离。搜索你应该从哪里开始阅读的地方叫做寻找。

当你顺序访问时,你只需要寻找一次,然后阅读,直到完成与该数据。在进行随机访问时,每次需要切换到文件中的其他位置时都需要进行查找。这可能会在硬盘上造成相当大的性能下降,因为在磁盘驱动器上搜索的成本非常高。

According to the hadoop definitive guide

HDFS is a filesystem designed for storing very large files with streaming or sequential data access patterns

what is streaming or sequential data access? how will it reduce the seek time of disk?

解决方案

This is not really specific to Hadoop.

Sequential Access pattern is when you read your data in sequence (often from start to finish). Consider a book example. When reading a novel, you use sequential order: you start with page 1, then move to page 2 and so on. The other common pattern is called Random Access. This is when you jump from one place to another, and possibly even backwards when reading data. For a book example, consider a dictionary. You don't read it like you read a novel. Instead, you search for your word in the middle somewhere. And when you're done looking up that word, you may perhaps go look for another word that is located hundreds of pages away from where you have your book open to at the moment. That searching of where you should start reading from is called a "seek".

When you access sequentially, you only need to seek once and then read until you're done with that data. When doing random access, you need to seek every time you want to switch to a different place in your file. This can be quite a performance hit on hard drives, because seeking is really expensive on magnetic drives.

这篇关于Hadoop顺序数据访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆