我应该在HDFS上放置程序吗? [英] Should I put programs on HDFS?
问题描述
我应该将程序放在HDFS上还是保留在本地? 我说的是一个二进制文件:
Shoud I put programs on HDFS or keep them local? I am talking about a binary file which is:
- 由spark-submit启动
- 每天执行
- 在RDD/Dataframe上执行火花映射减少功能
- 是一个JAR
- 重量20莫
- 处理大量数据,此数据文件位于HDFS上
我认为这是一个坏主意,因为在HDFS上分发可执行文件可能会减慢执行速度.我认为对于大于64 Mo(Hadoop块大小)的文件,这甚至会更糟.但是,我没有找到有关此资源.另外,我不知道有关内存管理的后果(是否为持有JAR副本的每个节点都复制了Java堆?)
I would think it is a bad idea, since distributing an executable file on HDFS might slow down the execution. I think it would be even worst for a file which is larger than 64 Mo (Hadoop block size). However, I did not find ressources about that. Plus, I do not know the consequences about memory management (is java heap replicated for each node that holds a copy of the JAR?)
推荐答案
是的,这正是YARN共享缓存背后的概念.
Yes, this is exactly the concept behind YARN's shared cache.
这样做的主要原因是,如果您有大量与工作相关的资源,并且将其作为本地资源提交会浪费网络带宽.
The main reason for doing this is if you have a large amount of resources tied to jobs, and submitting them as local resources wastes network bandwidth.
请参阅Slideshare以更详细地了解性能影响:
Refer to the Slideshare to understand the performance impacts in more detail:
- Slideshare: Hadoop Summit 2015: A Secure Public Cache For YARN Application Resources
- YARN Shared Cache
- YARN-1492 truly shared cache for jars (jobjar/libjar)
这篇关于我应该在HDFS上放置程序吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!