为什么RDD是不是在内存中坚持了火花每次迭代? [英] Why the RDD is not persisted in memory for every iteration in spark?

查看:237
本文介绍了为什么RDD是不是在内存中坚持了火花每次迭代?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的机器学习应用的火花。火花和Hadoop共享同一台计算机集群,任何资源管理器如纱。在运行火花任务,我们可以运行Hadoop的工作。

I use the spark for machine learning application. The spark and hadoop share the same computer clusters with out any resource manger such as yarn. We can run hadoop job while running spark task.

但机器学习中的应用跑的那么慢。我发现每互为作用,一些工人需要添加一些RDD到内存中。就像这样:

But the machine learning application run so slowly. I found that for every interation, some workers need to add some rdd into memory. Just like this:

243413 14/07/23 13:30:07 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_17 in memory on XXX:48238 (size: 118.3 MB, free: 16.2 GB)
243414 14/07/23 13:30:07 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_17 in memory on XXX:48238 (size: 118.3 MB, free: 16.2 GB)
243415 14/07/23 13:30:08 INFO BlockManagerMasterActor$BlockManagerInfo: Added rdd_2_19 in memory on TS-XXX:48238 (size: 119.0 MB, free: 16.1 GB)

所以,我觉得对于重装RDD的重新计算使应用程序非常缓慢。

So, I think the recomputing for reload the rdd make the application so slowly.

然后,我的问题是,为什么在RDD内存并没有持续,当有足够的可用内存?因为Hadoop作业的?

Then, my question is why the rdd was not persisted in the memory when there was enough free memory? because of the hadoop jobs?

我添加以下JVM参数:-Xmx10g -Xms10g

I add the following jvm parameters: -Xmx10g -Xms10g

我发现有少RDD添加动作比以前,任务运行时间比以前缩短。但对于一个阶段的总时间也过大。在WebUI,我发现:

I found there was less rdd add actions than before, and the task run time was shorter than before. But the total time for one stage is also too large. From the webUI, I found that:

有关的每一个阶段,所有的工人不同时启动。例如,当worker_1成品10的任务,该worker_2出现在WebUI并启动任务。而这种LED的很长一段时间的阶段。

For every stage, all the workers were not start at the same time. For example, when the worker_1 finished 10 tasks, the worker_2 appear on the webUI and start the tasks. And this leds to a long time stage.

我们的火花集群工作在独立模式。

our spark cluster works in standalone model.

推荐答案

这是很难说有什么错你的工作,但这里有一些提示。

It is hard to say what's wrong with your job, but here are some hints.

首先,你可以尝试调用坚持()上intermidiate RDD s到标记,你希望他们缓存。
其次,星火自动在洗牌操作盘的结果,每个节点上的存储 RDD S,所以也许问题不在于缓存的。

First, you can try to call persist() on intermidiate RDDs to mark that you want them cached. Second, Spark is automatically storing on disk results of shuffling operations on RDDs at each node, so maybe the problem is not in caching at all.

您可以在这里找到一些额外的信息:

You can find some additional information here:

  • RDD Persistence
  • Tuning Spark

这篇关于为什么RDD是不是在内存中坚持了火花每次迭代?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆