什么是“数据的本地缓存“在本文的背景下是什么意思? [英] what does " local caching of data" mean in the context of this article?

查看:161
本文介绍了什么是“数据的本地缓存“在本文的背景下是什么意思?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从下面的文本 -
(http://developer.yahoo.com/hadoop/tutorial/module2.html),它提到顺序可读的大文件不适合本地缓存。但我不明白这里的本地是什么意思......

我认为有两个假设:一个是Client从HDFS缓存数据,另一个是datanode缓存本地文件系统中的hdfs数据或客户端的内存快速访问。有谁能解释更多?非常感谢。




尽管HDFS具有很高的可扩展性,但其高性能设计也将其限制为特定级别的应用程序;它不像NFS那样通用。 HDFS提供了额外的决策和折衷方案,数额巨大
。特别是:

使用HDFS的应用程序假定从
文件执行长序列流读取。 HDFS经过优化以提供流式读取性能;这是以
随机查找时间到文件中的任意位置为代价的。



数据将被写入HDFS一次,然后读取数次;不支持已经关闭的文件
更新。 (对Hadoop的扩展将提供
支持,以将新数据附加到文件的末尾;它计划包含在
Hadoop 0.19中,但尚未提供。)



由于文件的大小以及读取的顺序特性,系统不会为数据的本地缓存提供机制。缓存的开销非常大,
应该简单地从HDFS源重新读取数据。



个别机器被假定为失败经常性的,永久性的和间歇​​性的
。集群必须能够承受几台
机器的完全故障,可能很多机器同时发生故障(例如机架全部故障)。
虽然绩效可能会随着损失的机器数量成比例下降,但整个系统不应该变得过于缓慢,信息也不应该丢失。数据复制

策略解决了这个问题。



解决方案

任何真正的Mapreduce作业都可能会处理来自HDFS的GB(10/100 / 1000s)数据。因此,任何一个映射器实例最有可能将处理相当数量的数据(典型的块大小为64/128/256 MB,取决于您的配置) (它会从头到尾读取整个文件/块)。

在同一台机器上运行的另一个映射器实例也不太可能想要处理该数据块在不久的将来再次发生,更多的是,多个映射器实例也将在任何一个TaskTracker中与该映射器一起处理数据(希望有少数几个实际上是本地数据的实际位置,即副本的数据块也存在于映射器实例运行的同一台机器上)。

考虑到所有这些,缓存从HDFS读取的数据可能无法获得你很多 - 在另一个块被查询之前,你很可能不会获得该数据的缓存命中,并最终将其替换到缓存中。


From the following paragraphs of Text—— (http://developer.yahoo.com/hadoop/tutorial/module2.html),It mentions that sequential readable large files are not suitable for local caching. but I don't understand what does local here mean...

There are two assumptions in my opinion: one is Client caches data from HDFS and the other is datanode caches hdfs data in its local filesystem or Memory for Clients to access quickly. is there anyone who can explain more? Thanks a lot.


But while HDFS is very scalable, its high performance design also restricts it to a particular class of applications; it is not as general-purpose as NFS. There are a large number of additional decisions and trade-offs that were made with HDFS. In particular:

Applications that use HDFS are assumed to perform long sequential streaming reads from files. HDFS is optimized to provide streaming read performance; this comes at the expense of random seek times to arbitrary positions in files.

Data will be written to the HDFS once and then read several times; updates to files after they have already been closed are not supported. (An extension to Hadoop will provide support for appending new data to the ends of files; it is scheduled to be included in Hadoop 0.19 but is not available yet.)

Due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data. The overhead of caching is great enough that data should simply be re-read from HDFS source.

Individual machines are assumed to fail on a frequent basis, both permanently and intermittently. The cluster must be able to withstand the complete failure of several machines, possibly many happening at the same time (e.g., if a rack fails all together). While performance may degrade proportional to the number of machines lost, the system as a whole should not become overly slow, nor should information be lost. Data replication
strategies combat this problem.


解决方案

Any real Mapreduce job is probably going to process GB's (10/100/1000s) of data from HDFS.

Therefore any one mapper instance is most probably going to be processing a fair amount of data (typical block size is 64/128/256 MB depending on your configuration) in a sequential nature (it will read the file / block in its entirety from start to end.

It is also unlikely that another mapper instance running on the same machine will want to process that data block again any time in the immediate future, more so that multiple mapper instances will also be processing data alongside this mapper in any one TaskTracker (hopefully with a fair few being 'local' to actually physical location of the data, i.e. a replica of the data block also exists on the same machine the mapper instance is running).

With all this in mind, caching the data read from HDFS is probably not going to gain you much - you'll most probably not get a cache hit on that data before another block is queried and will ultimately replace it in the cache.

这篇关于什么是“数据的本地缓存“在本文的背景下是什么意思?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆