火花图(功能).缓存慢 [英] spark map(func).cache slow

查看:56
本文介绍了火花图(功能).缓存慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用cache存储数据时,我发现spark的运行速度非常慢.但是,当我不使用cache方法时,速度非常好.我的主要配置文件如下:

When I use the cache to store data,I found that spark is running very slow. However, when I don't use cache Method,the speed is very good.My main profile is follows:

SPARK_JAVA_OPTS+="-Dspark.local.dir=/home/wangchao/hadoop-yarn-spark/tmp_out_info 
-Dspark.rdd.compress=true -Dspark.storage.memoryFraction=0.4 
-Dspark.shuffle.spill=false -Dspark.executor.memory=1800m -Dspark.akka.frameSize=100 
-Dspark.default.parallelism=6"

我的测试代码是:

val file = sc.textFile("hdfs://10.168.9.240:9000/user/bailin/filename")
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).cache()..reduceByKey(_+_)
count.collect()

对于我如何解决此问题的任何答案或建议,我们将不胜感激.

Any answers or suggestions on how I can resolve this are greatly appreciated.

推荐答案

cache在您正在使用的上下文中无用.在这种情况下,cache表示将映射结果.map(word => (word, 1))保存在内存中.而如果您不调用它,则可将reducer链接到地图的末尾,并在使用地图结果后将其丢弃.在创建后在RDD上调用多个转换/动作的情况下,最好使用cache.例如,如果您创建一个数据集并希望加入两个不同的数据集,则对它进行缓存会很有帮助,因为如果您不进行第二次加入,则会重新计算整个RDD.这是Spark网站上的一个容易理解的示例.

cache is useless in the context you are using it. In this situation cache is saying save the result of the map, .map(word => (word, 1)) in memory. Whereas if you didn't call it the reducer could be chained to the end of the map and the maps results discarded after they are used. cache is better used in a situation where multiple transformations/actions will be called on the RDD after it is created. For example if you create a data set you want to join to 2 different datasets it is helpful to cache it, because if you don't on the second join the whole RDD will be recalculated. Here is an easily understandable example from spark's website.

val file = spark.textFile("hdfs://...")
val errors = file.filter(line => line.contains("ERROR")).cache() //errors is cached to prevent recalculation when the two filters are called
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains("MySQL")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains("MySQL")).collect()

内部缓存所做的事情是通过将RDD的祖先保存在内存中/保存到磁盘(取决于存储级别)来删除它的祖先,RDD必须保存其祖先的原因是可以按需重新计算它,是RDD的恢复方法.

What cache is doing internally is removing the ancestors of an RDD by keeping it in memory/saving to disk(depending on the storage level), the reason an RDD must save its ancestors is so it can be recalculated on demand, this is the recovery method of RDD's.

这篇关于火花图(功能).缓存慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆