为什么Hadoop文件系统不支持随机I / O? [英] Why doesn't Hadoop file system support random I/O?

查看:144
本文介绍了为什么Hadoop文件系统不支持随机I / O?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

像Google File System和Hadoop这样的分布式文件系统不支持随机I / O。

(不能修改以前写的文件,只能写和附加。 )

为什么他们设计这样的文件系统?

设计的重要优点是什么?



我知道Hadoop会支持修改被写入的数据。

但是他们说,性能会很不好。为什么?

解决方案

Hadoop分发和复制文件。由于文件被复制,任何写操作将不得不在整个网络中找到每个复制节并更新文件。这将大大增加手术的时间。更新文件可以将其压入块大小,并要求文件分成2个块,然后复制第2个块。我不知道内部,什么时候/如何拆分块...但这是一个潜在的复杂。



如果工作失败或已经死亡做了更新并重新运行?它可以多次更新文件。



在分布式系统中不更新文件的好处在于,在更新文件时,您不知道还有谁在使用该文件,您不知道件被储存。有潜在的超时(节点与块是无响应的),所以你可能最终与不匹配的数据(我不知道hadoop的内部和一个节点下来更新可能会被处理,只是我头脑风暴)

在HDFS上更新文件有很多潜在的问题(上面列出的一些)。没有一个是不可克服的,但它们需要一个性能上的冲击来检查和解释。

由于HDFS的主要目的是存储用于mapreduce的数据,因此行级别在这个阶段更新并不重要。


The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)

Why did they design file system like this?
What are the important advantages of the design?

P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?

解决方案

Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.

What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.

The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)

There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.

Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.

这篇关于为什么Hadoop文件系统不支持随机I / O?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆