Linux服务器崩溃R并行 - 在unserialize中出错(节点$ con):从连接读取错误 [英] Linux Server crash in R parallel - Error in unserialize(node$con) : error reading from connection

查看:1815
本文介绍了Linux服务器崩溃R并行 - 在unserialize中出错(节点$ con):从连接读取错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在linux集群中运行R中的代码 - 代码很复杂(超过两千行代码),涉及40多个R包和几百个变量。但是,它在Windows和linux版本的R上运行。

I am running code in R within a linux cluster - the code is complex (over two thousand lines of code), involves over 40 R packages and several hundred variables. However, it does run both on Windows and linux versions of R.

我现在在爱丁堡大学EDCF高性能计算集群上运行代码,代码并行运行。在DEoptim中调用并行代码,基本上,在一些初始化之后,并行运行一系列函数,并将结果发送回DEoptim算法,以及作为绘图和数据表保存在我自己的空间上 - 并且重要的是代码运行和工作!

I am now running the code on Edinburgh University EDCF high performance computing cluster and the code is ran in parallel. The parallel code is called within DEoptim which basically, after some initialization, runs a series of functions in parallel and the results are sent back to the DEoptim algorithm as well as being saved as a plot and data table on my own space - and importantly the code runs and works!

代码模拟一个区域的水文,我可以设置代码模拟历史条件在任何时间段我想 - 从一天到30年份。对于一个月并行,结果大约每70秒一次,DEoptim算法简单地保持重新运行代码改变输入参数,试图找到最好的参数集。

The code models the hydrology of a region and I can set the code to simulate historic conditions over any time period I want - from one day to 30 years. For one month in parallel, results are spat out approximately every 70 seconds and the DEoptim algorithm simply keeps re-running the code changing the input parameters trying to find the best set of parameters.

代码似乎在多次运行中运行正常,但最终崩溃。昨晚,代码完成了100多次运行,没有问题,大约2小时,但最终崩溃 - 它总是最终崩溃 - 错误代码:

The code seems to run fine for a number of runs but eventually crashes. Last night the code completed over a 100 runs with no problem over approximately 2 hours but eventually crashed - and it always eventually crashes - with the error code:

Error in unserialize(node$con) : error reading from connection

系统I登录到是一个16核心服务器(16个真核心)根据:

The system I am logging onto is a 16 core server (16 true cores) according to:

detectCores()

我请求8插槽2GB内存。我试过运行这个在24核心机器大内存请求(4插槽的40GB内存),但它仍然最终崩溃。这个代码在Windows机器上运行了好几个星期,分散了成千上万的结果,在8个逻辑内核中并行运行。

and I requested 8 slots of 2GB memory. I have tried running this on a 24 core machine with large memory request (4 slots of 40GB memory) but it still eventually crashes. This code ran fine for several weeks on a Windows machine spitting out thousands of results, running in parallel across 8 logical cores.

所以我相信代码还行,是崩溃?它是一个内存问题吗?每次调用序列时,它包括:

So I believe the code is okay, but why is it crashing? Could it be a memory issue? Each time the sequence is called it includes:

rm(list=ls())
gc()

还是只是一个核心崩溃?我曾经认为,如果两个内核试图同时写入相同的数据文件,但我暂时删除它,它仍然崩溃,这可能是一个问题。有时它崩溃几分钟后,几个小时后的其他时间。我尝试从并行代码中删除一个核心:

Or is it simply a core crashing? I did think at some point that it could be a problem if two cores were trying to write to the same data file at the same time but I removed this temporarily and it still crashed. Sometimes it crashes after a few minutes and other times after a couple of hours. I have tried removing one core from the parallel code using:

cl <- parallel::makeCluster(parallel::detectCores()-1)

但它仍然崩溃。

有没有反正代码可以修改,所以它拒绝崩溃的输出例如如果错误,然后拒绝并继续!

Is there anyway that the code could be modified so it rejects crashed outputs e.g. if error then reject and carry on!!

或者,有没有办法修改代码来捕获为什么错误发生?

Or, is there a way of modifying the code to catch why the error happened at all?

我知道有很多其他serialize(node $ con)和unserialize(node $ con)错误帖子,但他们似乎不帮助我。

I know there are lots of other serialize(node$con) and unserialize(node$con) error posts but they don't seem to help me.

我真的很感谢一些帮助。

I'd really appreciate some help.

谢谢。

推荐答案

我有一个类似的问题运行在并行代码,依赖于其他几个包。尝试对%dopar%使用foreach(),并使用.packages选项指定代码依赖的包来将软件包加载到每个工作线程上。或者,在并行代码中明智地使用require()也可以工作。

I had a similar problem running in parallel code that was dependent on several other packages. Try using foreach() with %dopar% and specify the packages your code depends on with the .packages option to load the packages onto each worker. Alternatively, judicious use of require() within the parallel code may also work.

这篇关于Linux服务器崩溃R并行 - 在unserialize中出错(节点$ con):从连接读取错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆