高可用性计算:如何在不冒误报的情况下处理不返回系统的调用? [英] High availability computing: How to deal with a non-returning system call, without risking false positives?

查看:83
本文介绍了高可用性计算:如何在不冒误报的情况下处理不返回系统的调用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为高可用性系统的一部分,我有一个正在Linux计算机上运行的进程.该进程具有一个主线程,该线程从网络上的其他计算机接收请求并做出响应.还有一个心跳线线程,该线程定期发送多播心跳线数据包,以使网络上的其他进程知道该进程仍在运行并且可用-如果他们一段时间不从中获取任何心跳线数据包,则其中一个他们将假定此过程已死,并将接管其职责,以便整个系统可以继续工作.

这一切都很好,但是前一天整个系统出现故障,当我调查为什么发现以下情况时:

  1. 由于(显然是)Linux内核中的一个错误,该进程的主线程通过系统调用导致了内核哎呀".
  2. 由于内核"oops",系统调用从未返回,而使进程的主线程永久挂起.
  3. 心跳线OTOH继续正常运行,这意味着网络上的其他节点从未意识到此节点已发生故障,并且它们都没有介入以接管其职责...因此执行了所请求的任务没有执行,并且系统的运行有效地停止了.

我的问题是,是否有一个优雅的解决方案可以处理此类故障? (显然,要做的一件事是修复Linux内核,这样它就不会大惊小怪",但是考虑到Linux内核的复杂性,如果我的软件也可以更优雅地处理将来的其他内核错误,那就太好了.)

我不喜欢的一种解决方案是将心跳生成器放入主线程,而不是将其作为单独的线程运行,或者以其他方式将其绑定到主线程,以便在主线程挂起时无限期地发送,心跳将不会发送.我之所以不喜欢这种解决方案,是因为主线程不是实时线程,因此这样做可能会导致偶尔出现误报的情况,因为误将慢速完成操作误认为是节点故障.如果可以的话,我想避免误报.

理想情况下,可以采用某种方法来确保失败的syscall会返回错误代码,或者如果无法执行,则会使我的进程崩溃;这些中的任何一个都将停止心跳数据包的生成并允许故障转移继续进行.有没有办法做到这一点,或者不可靠的内核也会使我的用户进程也变得不可靠?

解决方案

我的第二个建议是使用ptrace查找当前指令指针.您可以有一个父线程,它跟踪您的进程并每秒中断一次,以检查当前的RIP值.这有点复杂,所以我写了一个演示程序:(仅x86_64,但是应该可以通过更改寄存器名称来解决.)

#define _GNU_SOURCE
#include <unistd.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <linux/ptrace.h>
#include <sys/user.h>
#include <time.h>

// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)

int main_thread(void *ptr) {
    // "main" thread is now running under the monitor
    printf("Hello from main!");
    while (1) {
        int c = getchar();
        if (c == EOF) { break; }
        nanosleep(&(struct timespec) {0, 200 * 1000 * 1000}, NULL);
        putchar(c);
    }
    return 0;
}

int main(int argc, char *argv[]) {
    void *vstack = malloc(STACK_SIZE);
    pid_t v;
    if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
        perror("failed to spawn child task");
        return 3;
    }
    printf("Target: %d; %d\n", v, getpid());
    long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
    if (ptv == -1) {
        perror("failed monitor sieze");
        exit(1);
    }
    struct user_regs_struct regs;
    fprintf(stderr, "beginning monitor...\n");
    while (1) {
        sleep(1);
        long ptv = ptrace(PTRACE_INTERRUPT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to interrupt main thread");
            break;
        }
        int status;
        if (waitpid(v, &status, __WCLONE) == -1) {
            perror("target wait failed");
            break;
        }
        if (!WIFSTOPPED(status)) { // this section is messy. do it better.
            fputs("target wait went wrong", stderr);
            break;
        }
        if ((status >> 8) != (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
            fputs("target wait went wrong (2)", stderr);
            break;
        }
        ptv = ptrace(PTRACE_GETREGS, v, NULL, &regs);
        if (ptv == -1) {
            perror("failed to peek at registers of thread");
            break;
        }
        fprintf(stderr, "%d -> RIP %x RSP %x\n", time(NULL), regs.rip, regs.rsp);
        ptv = ptrace(PTRACE_CONT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to resume main thread");
            break;
        }
    }
    return 2;
}

请注意,这不是生产质量代码.您需要做大量的工作来解决这些问题.

基于此,您应该能够确定程序计数器是否正在前进,并且可以将其与其他信息(例如/proc/PID/status)结合起来,以查看系统调用中是否繁忙.您也许还可以扩展ptrace的用法来检查正在使用的系统调用,以便可以检查它是否是合理的选择.

这是一个hacky解决方案,但我认为您不会找到针对该问题的非hacky解决方案.尽管有漏洞,但我认为(未经测试)它不会特别慢.我的实现每秒在非常短的时间内每秒暂停一次受监视的线程-我想这应该在100微秒的范围内.从理论上讲,这大约是0.01%的效率损失.

I have a process that's running on a Linux computer as part of a high-availability system. The process has a main thread that receives requests from the other computers on the network and responds to them. There is also a heartbeat thread that sends out multicast heartbeat packets periodically, to let the other processes on the network know that this process is still alive and available -- if they don't heart any heartbeat packets from it for a while, one of them will assume this process has died and will take over its duties, so that the system as a whole can continue to work.

This all works pretty well, but the other day the entire system failed, and when I investigated why I found the following:

  1. Due to (what is apparently) a bug in the box's Linux kernel, there was a kernel "oops" induced by a system call that this process's main thread made.
  2. Because of the kernel "oops", the system call never returned, leaving the process's main thread permanently hung.
  3. The heartbeat thread, OTOH, continue to operate correctly, which meant that the other nodes on the network never realized that this node had failed, and none of them stepped in to take over its duties... and so the requested tasks were not performed and the system's operation effectively halted.

My question is, is there an elegant solution that can handle this sort of failure? (Obviously one thing to do is fix the Linux kernel so it doesn't "oops", but given the complexity of the Linux kernel, it would be nice if my software could handle future other kernel bugs more gracefully as well).

One solution I don't like would be to put the heartbeat generator into the main thread, rather than running it as a separate thread, or in some other way tie it to the main thread so that if the main thread gets hung up indefinitely, heartbeats won't get sent. The reason I don't like this solution is because the main thread is not a real-time thread, and so doing this would introduce the possibility of occasional false-positives where a slow-to-complete operation was mistaken for a node failure. I'd like to avoid false positives if I can.

Ideally there would be some way to ensure that a failed syscall either returns an error code, or if that's not possible, crashes my process; either of those would halt the generation of heartbeat packets and allow a failover to proceed. Is there any way to do that, or does an unreliable kernel doom my user process to unreliability as well?

解决方案

My second suggestion is to use ptrace to find the current instruction pointer. You can have a parent thread that ptraces your process and interrupts it every second to check the current RIP value. This is somewhat complex, so I've written a demonstration program: (x86_64 only, but that should be fixable by changing the register names.)

#define _GNU_SOURCE
#include <unistd.h>
#include <sched.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/syscall.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <linux/ptrace.h>
#include <sys/user.h>
#include <time.h>

// this number is arbitrary - find a better one.
#define STACK_SIZE (1024 * 1024)

int main_thread(void *ptr) {
    // "main" thread is now running under the monitor
    printf("Hello from main!");
    while (1) {
        int c = getchar();
        if (c == EOF) { break; }
        nanosleep(&(struct timespec) {0, 200 * 1000 * 1000}, NULL);
        putchar(c);
    }
    return 0;
}

int main(int argc, char *argv[]) {
    void *vstack = malloc(STACK_SIZE);
    pid_t v;
    if (clone(main_thread, vstack + STACK_SIZE, CLONE_PARENT_SETTID | CLONE_FILES | CLONE_FS | CLONE_IO, NULL, &v) == -1) { // you'll want to check these flags
        perror("failed to spawn child task");
        return 3;
    }
    printf("Target: %d; %d\n", v, getpid());
    long ptv = ptrace(PTRACE_SEIZE, v, NULL, NULL);
    if (ptv == -1) {
        perror("failed monitor sieze");
        exit(1);
    }
    struct user_regs_struct regs;
    fprintf(stderr, "beginning monitor...\n");
    while (1) {
        sleep(1);
        long ptv = ptrace(PTRACE_INTERRUPT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to interrupt main thread");
            break;
        }
        int status;
        if (waitpid(v, &status, __WCLONE) == -1) {
            perror("target wait failed");
            break;
        }
        if (!WIFSTOPPED(status)) { // this section is messy. do it better.
            fputs("target wait went wrong", stderr);
            break;
        }
        if ((status >> 8) != (SIGTRAP | PTRACE_EVENT_STOP << 8)) {
            fputs("target wait went wrong (2)", stderr);
            break;
        }
        ptv = ptrace(PTRACE_GETREGS, v, NULL, &regs);
        if (ptv == -1) {
            perror("failed to peek at registers of thread");
            break;
        }
        fprintf(stderr, "%d -> RIP %x RSP %x\n", time(NULL), regs.rip, regs.rsp);
        ptv = ptrace(PTRACE_CONT, v, NULL, NULL);
        if (ptv == -1) {
            perror("failed to resume main thread");
            break;
        }
    }
    return 2;
}

Note that this is not production-quality code. You'll need to do a bunch of fixing things up.

Based on this, you should be able to figure out whether or not the program counter is advancing, and could combine this with other pieces of information (such as /proc/PID/status) to find if it's busy in a system call. You might also be able to extend the usage of ptrace to check what system calls are being used, so that you can check if it's a reasonable one to be waiting on.

This is a hacky solution, but I don't think that you'll find a non-hacky solution for this problem. Despite the hackiness, I don't think (this is untested) that it would be particularly slow; my implementation pauses the monitored thread once per second for a very short amount of time - which I would guess would be in the 100s of microseconds range. That's around 0.01% efficiency loss, theoretically.

这篇关于高可用性计算:如何在不冒误报的情况下处理不返回系统的调用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆