从bash脚本启动进程失败 [英] Starting a process from bash script failed

查看:82
本文介绍了从bash脚本启动进程失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个中央服务器,在该服务器上我定期从cron启动一个脚本,该脚本检查远程服务器.该检查是按顺序执行的,因此,首先是一台服务器,然后是另一台服务器.

I have a central server where I periodically start a script (from cron) which checks remote servers. The check is performed serially, so first, one server then another ... .

此脚本(从中央服务器)在远程计算机上启动另一个脚本(将其称为update.sh),并且该脚本(在远程计算机上)正在执行以下操作:

This script (from the central server) starts another script(lets call it update.sh) on the remote machine, and that script(on the remote machine) is doing something like this:

processID=`pgrep "processName"` 
kill $processID
startProcess.sh

进程被杀死,然后在脚本startProcess.sh中启动,如下所示:

The process is killed and then in the script startProcess.sh started like this:

pidof "processName"

if [ ! $? -eq 0 ]; then
    nohup "processName" "processArgs" >> "processLog" &
    pidof "processName"
    if [! $? -eq 0]; then
        echo "Error: failed to start process"
...

update.sh,startprocess.sh及其启动的实际二进制文件位于从中央服务器安装的NFS上.

The update.sh, startprocess.sh and the actual binary of the process that it starts is on a NFS mounted from the central server.

现在有时会发生什么,就是我尝试在startprocess.sh中启动的进程未启动,并且出现了错误.奇怪的是,它是随机的,有时一台计算机上的进程开始,而同一台计算机上的另一时间没有启动.我正在检查大约300台服务器,并且错误始终是随机的.

Now what happens sometimes, is that the process that I try to start within the startprocess.sh is not started and I get the error. The strange part is that it is random, sometime the process on one machine starts and another time on that same machine doesn't start. I'm checking about 300 servers and the errors are always random.

还有另一件事,远程服务器位于3个不同的地理位置(美国为2个,欧洲为1个),中央服务器在欧洲.我到目前为止发现的是,与欧洲的服务器相比,美国的服务器有更多错误.

There is another thing, the remote servers are at 3 different geo locations (2 in America and 1 in Europe), the central server is in Europe. From what I discover so far is that the servers in America have much more errors than those in Europe.

首先,我认为该错误与kill有关系,因此我在kill和startprocess.sh之间添加了一个睡眠,但这没什么区别.

First I thought that the error has to have something to do with kill so I added a sleep between the kill and the startprocess.sh but that didn't make any difference.

另外,似乎startprocess.sh的进程根本没有启动,或者启动时发生了某些事情,因为日志文件中没有输出并且应该应该日志文件中的输出.

Also it seems that the process from startprocess.sh is not started at all, or something happens to it right when it is being started, because there is no output in the logfile and there should be an output in the logfile.

所以,我在这里寻求帮助

So, here I'm asking for help

有人有这种问题吗,或者知道什么可能是错的吗?

Does anybody had this kind of problem, or know what might be wrong?

感谢您的帮助

推荐答案

(对不起,但我最初的回答是错误的……这里是更正)

使用 $? startProcess.sh 中获取后台进程的退出状态会导致错误的结果.男子的问题:

Using $? to get the exit status of the background process in startProcess.sh leads to wrong result. Man bash states:

Special Parameters
?      Expands to the status of the most recently executed foreground
       pipeline.

正如您在评论中提到的那样,获取后台进程退出状态的正确方法是使用内置的 wait .但是对于此必须处理SIGCHLD信号.

As You mentioned in your comment the proper way of getting the background process's exit status is using the wait built in. But for this bash has to process the SIGCHLD signal.

我为此做了一个小型测试环境,以展示其工作方式:

I made a small test environment for this to show how it can work:

这是一个脚本 loop.sh 作为后台进程运行:

Here is a script loop.sh to run as a background process:

#!/bin/bash
[ "$1" == -x ] && exit 1;
cnt=${1:-500}
while ((++c<=cnt)); do echo "SLEEPING [$$]: $c/$cnt"; sleep 5; done

如果arg是 -x ,则它以退出状态1退出以模拟错误.如果arg为num,则等待num * 5秒,将 SLEEPING [< PID>]< counter/gt;/< max_counter> 打印到标准输出.

If the arg is -x then it exits with exit status 1 to simulate an error. If arg is num, then waits num*5 seconds printing SLEEPING [<PID>] <counter>/<max_counter> to stdout.

第二个是启动器脚本.它在后台启动3个 loop.sh 脚本并显示其退出状态:

The second is the launcher script. It starts 3 loop.sh scripts in the background and prints their exit status:

#!/bin/bash

handle_chld() {
    local tmp=()
    for i in ${!pids[@]}; do
        if [ ! -d /proc/${pids[i]} ]; then
            wait ${pids[i]}
            echo "Stopped ${pids[i]}; exit code: $?"
            unset pids[i]
        fi
    done
}

set -o monitor
trap "handle_chld" CHLD

# Start background processes
./loop.sh 3 &
pids+=($!)
./loop.sh 2 &
pids+=($!)
./loop.sh -x &
pids+=($!)

# Wait until all background processes are stopped
while [ ${#pids[@]} -gt 0 ]; do echo "WAITING FOR: ${pids[@]}"; sleep 2; done
echo STOPPED

handle_chld函数将处理SIGCHLD信号.设置选项 monitor 可使非交互式脚本接收SIGCHLD.然后将陷阱设置为SIGCHLD信号.

The handle_chld function will handle the SIGCHLD signals. Setting option monitor enables for a non-interactive script to receive SIGCHLD. Then the trap is set for SIGCHLD signal.

然后启动后台进程.它们的所有PID都被记住在 pids 数组中.如果收到SIGCHLD,则在/proc/目录中检查停止了哪个子进程(缺少的子进程)(也可以使用 kill -0< PID> 内置).等待之后,后台进程的退出状态存储在著名的 $?伪变量中.

Then background processes are started. All of their PIDs are remembered in pids array. If SIGCHLD is received then it is checked amongst the /proc/ directories which child process was stopped (the missing one) (it could be also checked using kill -0 <PID> bash built-in). After wait the exit status of the background process is stored in the famous $? pseudo variable.

主脚本等待所有pid停止(否则它无法获取其子级的退出状态),并且脚本自身停止.

The main script waits for all pids to stop (otherwise it could not get the exit status of its children) and the it stops itself.

示例输出:

WAITING FOR: 13102 13103 13104
SLEEPING [13103]: 1/2
SLEEPING [13102]: 1/3
Stopped 13104; exit code: 1
WAITING FOR: 13102 13103
WAITING FOR: 13102 13103
SLEEPING [13103]: 2/2
SLEEPING [13102]: 2/3
WAITING FOR: 13102 13103
WAITING FOR: 13102 13103
SLEEPING [13102]: 3/3
Stopped 13103; exit code: 0
WAITING FOR: 13102
WAITING FOR: 13102
WAITING FOR: 13102
Stopped 13102; exit code: 0
STOPPED

可以看出退出代码已正确报告.

It can be seen that the exit codes are reported correctly.

我希望这可以有所帮助!

I hope this can help a bit!

这篇关于从bash脚本启动进程失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆