Spark中的Hadoop DistributedCache功能 [英] Hadoop DistributedCache functionality in Spark

查看:169
本文介绍了Spark中的Hadoop DistributedCache功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找类似于Spark中Hadoop分布式缓存的功能.我需要在所有节点中都存在一个相对较小的数据文件(带有一些索引值),以便进行一些计算.有什么方法可以在Spark中实现这一目标吗?

I am looking for a functionality similar to the distributed cache of Hadoop in Spark. I need a relatively small data file (with some index values) to be present in all nodes in order to make some calculations. Is there any approach that makes this possible in Spark?

到目前为止,我的解决方法是作为正常处理分发和减少索引文件,这在我的应用程序中大约需要10秒钟.之后,我将指示其为广播变量的文件保留为以下内容:

My workaround so far consists on distributing and reducing the index file as a normal processing, which takes around 10 seconds in my application. After that, I persist the file indicating it as a broadcast variable, as follows:

JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1);
ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect();    

final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);

这使程序能够理解变量 globalIndex 包含的内容.到目前为止,这对我来说可能是个补丁,但我认为它不是最佳解决方案.如果使用更大的数据集或大量的变量,它仍然有效吗?

This makes the program able to understand what the variable globalIndex contains. So far it is a patch that might be okay for me, but I consider it is not the best solution. Would it still be effective with a considerably bigger data-set or a big amount of variables?

注意:我正在运行在几个EC2实例上的独立集群上运行的Spark 1.0.0.

Note: I am using Spark 1.0.0 running on a Standalone cluster located at several EC2 instances.

推荐答案

请查看SparkContext.addFile()方法. 猜猜这就是您想要的.

Please have a look at SparkContext.addFile() method. Guess that is what you were looking for.

这篇关于Spark中的Hadoop DistributedCache功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆