工人结果未正确返回-下雪-调试 [英] Results of workers not returned properly - snow - debug

查看:149
本文介绍了工人结果未正确返回-下雪-调试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用R中的snow包在具有运行Linux OS的多台计算机(3)的SOCK集群上执行功能.我尝试同时使用parLapplyclusterApply运行代码.

I'm using the snow package in R to execute a function on a SOCK cluster with multiple machines(3) running on Linux OS. I tried to run the code with both parLapply and clusterApply.

在工作程序级别发生任何错误的情况下,工作程序节点的结果未正确返回给主服务器,这使得调试非常困难.我目前正在使用futile.logger独立记录工作者节点的每个心跳.似乎结果得到了正确的计算.但是,当我尝试在主节点上打印结果时(接收到worker的输出之后),我收到一条错误消息Error in checkForRemoteErrors(val): 8 nodes produced errors; first error: missing value where TRUE/FALSE needed.

In case of any error at the worker level, the results of the worker nodes are not returned properly to master making it very hard to debug. I'm currently logging every heartbeat of the worker nodes independently using futile.logger. It seems as if the results are properly computed. But when I tried to print the result at the master node (After receiving the output from workers) I get an error which says, Error in checkForRemoteErrors(val): 8 nodes produced errors; first error: missing value where TRUE/FALSE needed.

是否有任何方法可以更深入地调试工作人员的结果?

Is there any way to debug the results of the workers more deeply?

推荐答案

parLapplyclusterApply调用checkForRemoteErrors函数以检查任务错误,如果任何任务失败,它将引发错误. .不幸的是,尽管它显示错误消息,但是它没有提供任何有关导致此错误的工作程序代码的信息.但是,如果您修改工作人员/任务功能以捕获错误,则可以保留一些额外的信息,这些信息可能有助于确定错误发生的位置.

The checkForRemoteErrors function is called by parLapply and clusterApply to check for task errors, and it will throw an error if any of the tasks failed. Unfortunately, although it displays the error message, it doesn't provide any information about what worker code caused the error. But if you modify your worker/task function to catch errors, you can retain some extra information that may be helpful in determining where the error occurred.

例如,这是一个失败的简单雪花程序.请注意,它在创建集群时使用outfile='',以便显示程序的输出,这本身就是一种非常有用的调试技术:

For example, here's a simple snow program that fails. Note that it uses outfile='' when creating the cluster so that output from the program is displayed, which by itself is a very useful debugging technique:

library(snow)
cl <- makeSOCKcluster(2, outfile='')
problem <- function(i) {
  if (NA)
    j <- 999
  else
    j <- i
  2 * j
}
r <- parLapply(cl, 1:2, problem)

执行此操作时,您会看到来自checkForRemoteErrors的错误消息和一些其他消息,但是没有任何东西可以告诉您if语句引起了错误.为了在调用problem时捕获错误,我们定义workerfun:

When you execute this, you see the error message from checkForRemoteErrors and some other messages, but nothing that tells you that the if statement caused the error. To catch errors when calling problem, we define workerfun:

workerfun <- function(i) {
  tryCatch({
    problem(i)
  },
  error=function(e) {
    print(e)
    stop(e)
  })
}

现在我们用parLapply而不是problem执行workerfun,首先将problem导出到工人:

Now we execute workerfun with parLapply instead of problem, first exporting problem to the workers:

clusterExport(cl, c('problem'))
r <- parLapply(cl, 1:2, workerfun)

在其他消息中,我们现在看到

Among the other messages, we now see

<simpleError in if (NA) j <- 999 else j <- i: missing value where TRUE/FALSE needed>

,其中包括生成错误的实际if语句.当然,它不会告诉您表达式的文件名和行号,但通常足以让您解决问题.

which includes the actual if statement that generated the error. Of course, it doesn't tell you the file name and line number of the expression, but it's often enough to let you solve the problem.

这篇关于工人结果未正确返回-下雪-调试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆