错误的进程被杀死在其他节点? [英] Wrong process getting killed on other node?

查看:119
本文介绍了错误的进程被杀死在其他节点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个简单的程序(控制器)在单独的节点(工人)上运行一些计算。原因是如果工作节点耗尽内存,控制器仍然工作:

  -module(controller)。 
-compile(export_all)。

p(Msg,Args) - > io:format(〜p++ Msg,[time()| Args])。

progress_monitor(P,N) - >
timer:sleep(5 * 60 * 1000),
p(杀死正在使用策略的工人#〜p〜n,[N]),
exit(P,took_to_long) 。

start() - >
start(1)。
start(Strat) - >
P = spawn('worker @ localhost',worker,start,[Strat,self(),60000000000]),
p(starting worker using strategy#〜p〜n,[Strat] ,
spawn(controller,progress_monitor,[P,Strat]),
monitor(process,P),
receive
{'DOWN',_,_,P,Info } - >
p(work using strategy#〜p died。原因:〜p〜n,[Strat,Info]);
X - >
p(got result:〜p〜n,[X])
end,
case
4 - p(out of strategies。giving up〜n,[]);
_ - > timer:sleep(5000),%等待节点返回
start(Strat + 1)
end。为了测试它,我故意写了3个因子实现,将耗尽大量的内存和崩溃,并且第四个实现使用尾递归避免占用太多空间:

  -module(worker)。 
-compile(export_all)。

start(1,P,N) - > P! factorial1(N);
start(2,P,N) - > P! factorial2(N);
start(3,P,N) - > P! factorial3(N);
start(4,P,N) - > P! factorial4(N,1)。

factorial1(0) - > 1;
factorial1(N) - > N * factorial1(N-1)。

factorial2(N) - >
case N of
0 - > 1;
_ - > N * factorial2(N-1)
end。

factorial3(N) - > list:foldl(fun(X,Y) - > X * Y end,1,lists:seq(1,N))。

factorial4(0,A) - >一个;
factorial4(N,A) - > factorial4(N-1,A * N)。

请注意,即使使用尾递归版本,我调用它60000000000,这可能需要几天在我的机器上甚至与 factorial4 。下面是运行控制器的输出:

  $ erl -sname'controller @ localhost'
Erlang R16B -5.10.1)[source] [64位] [smp:4:4] [异步线程:10] [hipe] [kernel-poll:false]

Eshell V5.10.1 abort with ^ G)
(controller @ localhost)1> c(工人)。
{ok,worker}
(controller @ localhost)2> c(控制器)。
{ok,controller}
(controller @ localhost)3> controller:start()。
{23,24,28}启动工人使用策略#1
{23,25,13}工人使用策略#1死亡。原因:noconnection
{23,25,18}启动工人使用策略#2
{23,26,2}工人使用策略#2死亡。原因:noconnection
{23,26,7}启动工人使用策略#3
{23,26,40}工人使用策略#3死亡。原因:noconnection
{23,26,45}启动工人使用策略#4
{23,29,28}杀死正在使用策略#1的工人
{23,29,29 }工人使用策略#4死亡。原因:took_to_long
{23,29,29}的策略。放弃
ok

它几乎可以工作,但工人4太早已接近23:31:45,而不是23:29:29)。看起来更深,只有工人#1被企图杀死,没有其他人。所以工人#4不应该死了,但它确实。为什么?我们甚至可以看到原因是 took_to_long progress_monitor #1从23:24:28开始,五分钟前23:29:29。所以它看起来像 progress_monitor #1杀死工人#4而不是工人#1。为什么要杀死错误的进程?



这是我运行控制器时的输出:

  $ while true; do erl -sname'worker @ localhost'; done 
Erlang R16B(erts-5.10.1)[source] [64位] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1(abort with ^ G)
(worker @ localhost)1>
崩溃转储写入:erl_crash.dump
eheap_alloc:无法分配2733560184字节的内存(类型为heap)。
中止
Erlang R16B(erts-5.10.1)[source] [64位] [smp:4:4] [异步线程:10] [hipe] [kernel-poll:false]

Eshell V5.10.1(abort with ^ G)
(worker @ localhost)1>
崩溃转储写入:erl_crash.dump
eheap_alloc:无法分配2733560184字节的内存(类型为heap)。
中止
Erlang R16B(erts-5.10.1)[source] [64位] [smp:4:4] [异步线程:10] [hipe] [kernel-poll:false]

Eshell V5.10.1(abort with ^ G)
(worker @ localhost)1>
崩溃转储被写入:erl_crash.dump
eheap_alloc:无法分配2733560184字节的内存(类型为old_heap)。
中止
Erlang R16B(erts-5.10.1)[source] [64位] [smp:4:4] [异步线程:10] [hipe] [kernel-poll:false]

Eshell V5.10.1(abort with ^ G)
(worker @ localhost)1>


解决方案

有几个问题, >创建号。



由于您不取消 progress_monitor 它会在5分钟后始终发送退出信号。



计算时间长和/或VM运行缓慢,因此



4个工作节点以相同名称工序顺序启动 @localhost ,并且第一个和第四个节点的创建号码相同



创建号引用和pid中的创建字段)是一种机制,用于防止由崩溃节点创建的pid和引用由具有相同名称的新节点解释。当你尝试在节点消失后杀死worker 1时,你不希望在重新启动的节点中杀死进程。



当节点发送pid或引用时,会对其创建号进行编码< a>。当它收到来自另一个节点的pid或引用时,它检查pid中的创建号与自己的创建号匹配。创建号由 epmd 遵循1,2,3序列



不幸的是,当第四个节点获取退出消息时, 创建号匹配,因为此序列包裹。因为节点产生进程并且在之前完成相同的事情(初始化erlang),节点4的工人的pid匹配节点1的工人的pid。



因此,控制器最终会杀死工人4,认为它是工人1.



为了避免这种情况,你需要比创建号更强的东西,如果有4工人在控制器的pid或参考的生命周期内。


I wrote a simple program ("controller") to run some computation on a separate node ("worker"). The reason being that if the worker node runs out of memory, the controller still works:

-module(controller).
-compile(export_all).

p(Msg,Args) -> io:format("~p " ++ Msg, [time() | Args]).

progress_monitor(P,N) ->
    timer:sleep(5*60*1000),
    p("killing the worker which was using strategy #~p~n", [N]),
    exit(P, took_to_long).

start() ->
    start(1).
start(Strat) ->
    P = spawn('worker@localhost', worker, start, [Strat,self(),60000000000]),
    p("starting worker using strategy #~p~n", [Strat]),
    spawn(controller,progress_monitor,[P,Strat]),
    monitor(process, P),
    receive
        {'DOWN', _, _, P, Info} ->
            p("worker using strategy #~p died. reason: ~p~n", [Strat, Info]);
        X ->
            p("got result: ~p~n", [X])
    end,
    case Strat of
        4 -> p("out of strategies. giving up~n", []);
        _ -> timer:sleep(5000), % wait for node to come back
             start(Strat + 1)
    end.

To test it, I deliberately wrote 3 factorial implementations that will use up lots of memory and crash, and a fourth implementation which uses tail recursion to avoid taking too much space:

-module(worker).
-compile(export_all).

start(1,P,N) -> P ! factorial1(N);
start(2,P,N) -> P ! factorial2(N);
start(3,P,N) -> P ! factorial3(N);
start(4,P,N) -> P ! factorial4(N,1).

factorial1(0) -> 1;
factorial1(N) -> N*factorial1(N-1).

factorial2(N) ->
    case N of
        0 -> 1;
        _ -> N*factorial2(N-1)
    end.

factorial3(N) -> lists:foldl(fun(X,Y) -> X*Y end, 1, lists:seq(1,N)).

factorial4(0, A) -> A;
factorial4(N, A) -> factorial4(N-1, A*N).

Note even with the tail recursive version, I'm calling it with 60000000000, which will probably take days on my machine even with factorial4. Here is the output of running the controller:

$ erl -sname 'controller@localhost'
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(controller@localhost)1> c(worker).
{ok,worker}
(controller@localhost)2> c(controller).
{ok,controller}
(controller@localhost)3> controller:start().
{23,24,28} starting worker using strategy #1
{23,25,13} worker using strategy #1 died. reason: noconnection
{23,25,18} starting worker using strategy #2
{23,26,2} worker using strategy #2 died. reason: noconnection
{23,26,7} starting worker using strategy #3
{23,26,40} worker using strategy #3 died. reason: noconnection
{23,26,45} starting worker using strategy #4
{23,29,28} killing the worker which was using strategy #1
{23,29,29} worker using strategy #4 died. reason: took_to_long
{23,29,29} out of strategies. giving up
ok

It almost works, but worker #4 was killed too early (should have been close to 23:31:45, not 23:29:29). Looking deeper, only worker #1 was attempted to be killed, and no others. So worker #4 should not have died, yet it did. Why? We can even see that the reason was took_to_long, and that progress_monitor #1 started at 23:24:28, five minutes before 23:29:29. So it looks like progress_monitor #1 killed worker #4 instead of worker #1. Why did it kill the wrong process?

Here is the output of the worker when I ran the controller:

$ while true; do erl -sname 'worker@localhost'; done
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "old_heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 

解决方案

There are several issues, and eventually you experienced creation number wrap around.

Since you do not cancel the progress_monitor process, it will send always an exit signal after 5 minutes.

The computation is long and/or the VM is slow, hence process 4 is still running 5 minutes after the progress monitor for process 1 was started.

The 4 worker nodes were started sequentially with the same name workers@localhost, and the creation numbers of the first and the fourth node are the same.

Creation numbers (creation field in references and pids) are a mechanism to prevent pids and references created by a crashed node to be interpreted by a new node with the same name. Exactly what you expect in your code when you try to kill worker 1 after the node is long gone, you don't intend to kill a process in a restarted node.

When a node sends a pid or a reference, it encodes its creation number. When it receives a pid or a reference from another node, it checks that the creation number in the pid matches its own creation number. The creation number are attributed by epmd following the 1,2,3 sequence.

Here, unfortunately, when the 4th node gets the exit message, the creation number matches because this sequence wrapped. Since the nodes spawn the process and did the exact same thing before (initialized erlang), the pid of the worker of node 4 matches the pid of the worker of node 1.

As a result, the controller eventually kills worker 4 believing it is worker 1.

To avoid this, you need something more robust than the creation number if there can be 4 workers within the lifespan of a pid or a reference in the controller.

这篇关于错误的进程被杀死在其他节点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆