我应该在HDFS上放置程序吗? [英] Should I put programs on HDFS?

查看:100
本文介绍了我应该在HDFS上放置程序吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该将程序放在HDFS上还是保留在本地? 我说的是一个二进制文件:

Shoud I put programs on HDFS or keep them local? I am talking about a binary file which is:

  • 由spark-submit启动
  • 每天执行
  • 在RDD/Dataframe上执行火花映射减少功能
  • 是一个JAR
  • 重量20莫
  • 处理大量数据,此数据文件位于HDFS上

我认为这是一个坏主意,因为在HDFS上分发可执行文件可能会减慢执行速度.我认为对于大于64 Mo(Hadoop块大小)的文件,这甚至会更糟.但是,我没有找到有关此资源.另外,我不知道有关内存管理的后果(是否为持有JAR副本的每个节点都复制了Java堆?)

I would think it is a bad idea, since distributing an executable file on HDFS might slow down the execution. I think it would be even worst for a file which is larger than 64 Mo (Hadoop block size). However, I did not find ressources about that. Plus, I do not know the consequences about memory management (is java heap replicated for each node that holds a copy of the JAR?)

推荐答案

是的,这正是YARN共享缓存背后的概念.

Yes, this is exactly the concept behind YARN's shared cache.

这样做的主要原因是,如果您有大量与工作相关的资源,并且将其作为本地资源提交会浪费网络带宽.

The main reason for doing this is if you have a large amount of resources tied to jobs, and submitting them as local resources wastes network bandwidth.

请参阅Slideshare以更详细地了解性能影响:

Refer to the Slideshare to understand the performance impacts in more detail:

  • Slideshare: Hadoop Summit 2015: A Secure Public Cache For YARN Application Resources
  • YARN Shared Cache
  • YARN-1492 truly shared cache for jars (jobjar/libjar)

这篇关于我应该在HDFS上放置程序吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆