如何处理纱线客户端中运行时间过长的任务(与工作中的其他任务相比)? [英] How to deal with tasks running too long (comparing to others in job) in yarn-client?

查看:110
本文介绍了如何处理纱线客户端中运行时间过长的任务(与工作中的其他任务相比)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们使用Spark集群作为yarn-client来计算多个业务,但有时我们的任务运行时间过长:

We use a Spark cluster as yarn-client to calculate several business, but sometimes we have a task run too long time:

我们没有设置超时时间,但我认为默认的火花任务超时时间不会太长,例如此处(1.7h).

We don't set timeout but I think default timeout a spark task is not too long such here ( 1.7h ).

任何人都可以帮我解决这个问题?

Anyone give me an ideal to work around this issue ???

推荐答案

如果耗时太长,spark无法终止其任务.

There is no way for spark to kill its tasks if its taking too long.

但是我想出了一种使用推测

But I figured out a way to handle this using speculation,

这意味着,如果一个或多个任务在一个阶段中运行缓慢,则它们 将重新启动.

This means if one or more tasks are running slowly in a stage, they will be re-launched.

spark.speculation                  true
spark.speculation.multiplier       2
spark.speculation.quantile         0

注意:spark.speculation.quantile表示推测"将从您的第一个任务开始.因此,请谨慎使用.我之所以使用它,是因为随着时间的流逝,某些作业会由于GC而变慢.因此,我认为您应该知道何时使用此功能-不是灵丹妙药.

Note: spark.speculation.quantile means the "speculation" will kick in from your first task. So use it with caution. I am using it because some jobs get slowed down due to GC over time. So I think you should know when to use this - its not a silver bullet.

一些相关链接: http://mail-archives.us.apache.org/mod_mbox/spark-user/201506.mbox/%3CCAPmMX=rOVQf7JtDu0uwnp1xNYNyz4xPgXYayKex42AZ_9Pvjug@mail.gmail.com%3E

更新

我找到了解决我的问题的方法(可能不适用于每个人).我为每个任务运行了一堆模拟,所以我增加了运行时的超时.如果模拟花费的时间更长(由于该特定运行的数据偏斜),它将超时.

I found a fix for my issue (might not work for everyone). I had a bunch of simulations running per task, so I added timeout around the run. If a simulation is taking longer (due to a data skew for that specific run), it will timeout.

ExecutorService executor = Executors.newCachedThreadPool();
Callable<SimResult> task = () -> simulator.run();

Future<SimResult> future = executor.submit(task);
try {
    result = future.get(1, TimeUnit.MINUTES);
} catch (TimeoutException ex) {
    future.cancel(true);
    SPARKLOG.info("Task timed out");
}

确保在simulator的主循环中处理中断,例如:

Make sure you handle an interrupt inside the simulator's main loop like:

if(Thread.currentThread().isInterrupted()){
    throw new InterruptedException();
} 

这篇关于如何处理纱线客户端中运行时间过长的任务(与工作中的其他任务相比)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆