分布式缓存和性能Hadoop [英] Distributed Cache and performance Hadoop

查看:154
本文介绍了分布式缓存和性能Hadoop的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想让我对hadoop分布式缓存清楚了解。我知道当我们将文件添加到分布式缓存时,这些文件会加载到群集中每个节点的磁盘上。

那么文件的数据如何传输到群集中的所有节点?它是否通过网络?如果是这样,它会不会造成网络紧张吗?



我有以下想法,它们是正确的吗?



如果文件很大,不会有网络拥塞吗?



如果节点数量很大,即使文件大小中等或小,复制文件并传输到所有节点,不会造成网络拥塞和内存限制吗?



请帮助我理解这些概念。



谢谢!!!

解决方案


  1. 通常通过HDFS通过网络传输。对于任何非数据本地任务来说,它不会对网络造成更大的压力。

  2. 如果文件很大,有可能网络拥塞,但是你已经将你的jar推送到所有这些任务跟踪器中,所以只要你的文件没有比jar大的多,你的开销不应该太糟糕。


  3. 文件的复制完全独立于将最终取得该文件的任务跟踪器的数量。复制将从一个节点链接到另一个节点,无论发生什么,都将成为具有容错分布式文件系统的代价。同样,假设分布式缓存中的文件与您的jar文件大小相同,网络拥塞不再是将jar放到所有任务跟踪器上的问题。


  4. ol>

    总体而言,只要按预期方式使用分布式缓存的开销是微乎其微的,这是一种将合理的小缓存数据推送到执行计算的任务跟踪器本地的方法。



    编辑:这是 DistributedCache 文档。请注意,这些文件是通过网址指定的。通常你会在你本地的hdfs:// setup中使用一些东西。


    I want to make my understanding about hadoop distributed cache clear. I know that when we add files to distributed cache, the files get loaded to the disk of every node in the cluster.

    So how do the data of the files get transmitted to all the nodes in the cluster. Is it through the network? If so, will it not cause a strain on the network?

    I have the following thoughts, are they correct?

    If the files are large, wont there be network congestion?

    If the number of nodes are large, even though the files are of medium or small size, the replication of the files and transmission to all nodes, wont it cause network congestion and memory constraints?

    Please help me in understanding these concepts.

    Thanks!!!

    解决方案

    1. Yes the files are transferred via the network, usually via HDFS. It will cause no more strain on the network than using HDFS for anything that's a non data local task.

    2. If the files are large, there is the possibility of network congestion, but you're already pushing your jar to all of these task trackers, so as long as your files are not too much bigger than your jar, your overhead shouldn't be too bad.

    3. The replication of the files is completely separate from the number of task trackers that will eventually pull this file. The replication will be chained from node to node as well and will be the cost of having a fault tolerant distributed file system no matter what. Again, the network congestion is no more a problem than pushing your jar to all of the task trackers, assuming the files in the distributed cache are of equivalent size of your jars.

    Overall the overhead of the distributed cache is minuscule as long as it is used as intended, as a way to push reasonably small cached data to be local to the task trackers doing the computation.

    Edit: Here is the DistributedCache documentation for 0.20. Note that the files are specified via urls. Usually you would use something on your local hdfs:// setup.

    这篇关于分布式缓存和性能Hadoop的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆