如何处理纱线客户端中运行时间过长的任务(与工作中的其他任务相比)? [英] How to deal with tasks running too long (comparing to others in job) in yarn-client?
问题描述
我们使用Spark集群作为yarn-client
来计算多个业务,但有时我们的任务运行时间过长:
We use a Spark cluster as yarn-client
to calculate several business, but sometimes we have a task run too long time:
我们没有设置超时时间,但我认为默认的火花任务超时时间不会太长,例如此处(1.7h).
We don't set timeout but I think default timeout a spark task is not too long such here ( 1.7h ).
任何人都可以帮我解决这个问题?
Anyone give me an ideal to work around this issue ???
推荐答案
如果耗时太长,spark无法终止其任务.
There is no way for spark to kill its tasks if its taking too long.
但是我想出了一种使用推测,
But I figured out a way to handle this using speculation,
这意味着,如果一个或多个任务在一个阶段中运行缓慢,则它们 将重新启动.
This means if one or more tasks are running slowly in a stage, they will be re-launched.
spark.speculation true
spark.speculation.multiplier 2
spark.speculation.quantile 0
注意:spark.speculation.quantile
表示推测"将从您的第一个任务开始.因此,请谨慎使用.我之所以使用它,是因为随着时间的流逝,某些作业会由于GC而变慢.因此,我认为您应该知道何时使用此功能-不是灵丹妙药.
Note: spark.speculation.quantile
means the "speculation" will kick in from your first task. So use it with caution. I am using it because some jobs get slowed down due to GC over time. So I think you should know when to use this - its not a silver bullet.
更新
我找到了解决我的问题的方法(可能不适用于每个人).我为每个任务运行了一堆模拟,所以我增加了运行时的超时.如果模拟花费的时间更长(由于该特定运行的数据偏斜),它将超时.
I found a fix for my issue (might not work for everyone). I had a bunch of simulations running per task, so I added timeout around the run. If a simulation is taking longer (due to a data skew for that specific run), it will timeout.
ExecutorService executor = Executors.newCachedThreadPool();
Callable<SimResult> task = () -> simulator.run();
Future<SimResult> future = executor.submit(task);
try {
result = future.get(1, TimeUnit.MINUTES);
} catch (TimeoutException ex) {
future.cancel(true);
SPARKLOG.info("Task timed out");
}
确保在simulator
的主循环中处理中断,例如:
Make sure you handle an interrupt inside the simulator
's main loop like:
if(Thread.currentThread().isInterrupted()){
throw new InterruptedException();
}
这篇关于如何处理纱线客户端中运行时间过长的任务(与工作中的其他任务相比)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!