对Hadoop中分布式缓存的困惑 [英] Confusion about distributed cache in Hadoop

查看:91
本文介绍了对Hadoop中分布式缓存的困惑的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

分发缓存实际上意味着什么?在分布式缓存中有一个文件意味着它在每个datanode中都可用,因此该数据不会进行节点间通信,还是这意味着该文件在每个节点的内存中?
如果没有,通过什么方法可以为整个工作在内存中存储一​​个文件?这可以为map-reduce和UDF都完成。



(特别是有一些配置数据,我想保留的配置数据比较小在内存中作为UDF适用于蜂巢查询...?)

感谢和问候,
Dhruv Kapur。

解决方案

分布式缓存是由Map-Reduce框架提供的一种工具,用于缓存应用程序所需的文件。一旦为作业缓存了一个文件,hadoop框架就会在映射/减少任务运行的每个数据节点上(文件系统中,而不是内存中)使用它。然后,您可以在Mapper或Reducer作业中以本地文件的形式访问缓存文件。现在,您可以轻松读取缓存文件并在代码中填充某些集合(例如Array,Hashmap等)。

请参阅 https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/ filecache / DistributedCache.html



让我知道你是否还有一些问题。



你可以在您的UDF代码中将缓存文件读取为本地文件。在使用JAVA API读取文件之后,只需填充任何集合(在内存中)。



引用URL http://www.lichun.cc/blog/2013/06/use-a-lookup-hashmap-in-hive-script/



-Ashish


What does the distribute cache actually mean? Having a file in distributed cache means that is it available in every datanode and hence there will be no internode communication for that data, or does it mean that the file is in memory in every node? If not, by what means can I have a file in memory for the entire job? Can this be done both for map-reduce, as well as for a UDF..

(In particular there is some configuration data, comparatively small that I would like to keep in memory as a UDF applies on hive query...? )

Thanks and regards, Dhruv Kapur.

解决方案

DistributedCache is a facility provided by the Map-Reduce framework to cache files needed by applications. Once you cache a file for your job, hadoop framework will make it available on each and every data nodes (in file system, not in memory) where you map/reduce tasks are running. Then you can access the cache file as local file in your Mapper Or Reducer job. Now you can easily read the cache file and populate some collection (e.g Array, Hashmap etc.) in your code.

Refer https://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/filecache/DistributedCache.html

Let me know if still you have some questions.

You can read the cache file as local file in your UDF code. After reading the file using JAVA APIs just populate any collection (In memory).

Refere URL http://www.lichun.cc/blog/2013/06/use-a-lookup-hashmap-in-hive-script/

-Ashish

这篇关于对Hadoop中分布式缓存的困惑的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆