传递给函数的指针意外更改 [英] Pointer passed to function changes unexpectedly

查看:53
本文介绍了传递给函数的指针意外更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在设计一个附加到 Pthreads 的基于预加载器的锁跟踪实用程序,但我遇到了一个奇怪的问题.该程序通过提供在运行时替换相关 Pthreads 函数的包装器来工作;这些做一些日志记录,然后将 args 传递给真正的 Pthreads 函数来完成工作.显然,它们不会修改传递给它们的参数.然而,在测试时,我发现传递给我的 pthread_cond_wait() 包装器的条件变量指针与传递给底层 Pthreads 函数的条件变量指针不匹配,该函数立即崩溃,futex 设施返回了一个意外的错误代码".从我收集到的信息来看,这通常表示传入的同步对象无效.来自 GDB 的相关堆栈跟踪:

I'm designing a preloader-based lock tracing utility that attaches to Pthreads, and I've run into a weird issue. The program works by providing wrappers that replace relevant Pthreads functions at runtime; these do some logging, and then pass the args to the real Pthreads function to do the work. They do not modify the arguments passed to them, obviously. However, when testing, I discovered that the condition variable pointer passed to my pthread_cond_wait() wrapper does not match the one that gets passed to the underlying Pthreads function, which promptly crashes with "futex facility returned an unexpected error code," which, from what I've gathered, usually indicates an invalid sync object passed in. Relevant stack trace from GDB:

#8  __pthread_cond_wait (cond=0x7f1b14000d12, mutex=0x55a2b961eec0) at pthread_cond_wait.c:638
#9  0x00007f1b1a47b6ae in pthread_cond_wait (cond=0x55a2b961f290, lk=0x55a2b961eec0)
    at pthread_trace.cpp:56

我很困惑.这是我的 pthread_cond_wait() 包装器的代码:

I'm pretty mystified. Here's the code for my pthread_cond_wait() wrapper:

int pthread_cond_wait(pthread_cond_t* cond, pthread_mutex_t* lk) {
        // log arrival at wait
        the_tracer.add_event(lktrace::event::COND_WAIT, (size_t) cond);
        // run pthreads function
        GET_REAL_FN(pthread_cond_wait, int, pthread_cond_t*, pthread_mutex_t*);
        int e = REAL_FN(cond, lk);
        if (e == 0) the_tracer.add_event(lktrace::event::COND_LEAVE, (size_t) cond);
        else {
                the_tracer.add_event(lktrace::event::COND_ERR, (size_t) cond);
        }
        return e;
}

// GET_REAL_FN is defined as:
#define GET_REAL_FN(name, rtn, params...) \
        typedef rtn (*real_fn_t)(params); \
        static const real_fn_t REAL_FN = (real_fn_t) dlsym(RTLD_NEXT, #name); \
        assert(REAL_FN != NULL) // semicolon absence intentional

这是 glibc 2.31 中 __pthread_cond_wait 的代码(这是正常调用 pthread_cond_wait 时调用的函数,由于版本控制,它具有不同的名称.上面的堆栈跟踪确认这是 REAL_FN 指向的函数):

And here's the code for __pthread_cond_wait in glibc 2.31 (this is the function that gets called if you call pthread_cond_wait normally, it has a different name because of versioning stuff. The stack trace above confirms that this is the function that REAL_FN points to):

int
__pthread_cond_wait (pthread_cond_t *cond, pthread_mutex_t *mutex)
{
  /* clockid is unused when abstime is NULL. */
  return __pthread_cond_wait_common (cond, mutex, 0, NULL);
}   

如您所见,这两个函数都没有修改 cond,但在两个框架中却不尽相同.检查核心转储中的两个不同指针表明它们也指向不同的内容.我还可以在核心转储中看到 cond 在我的包装函数中似乎没有改变(即它仍然等于 0x5... 在崩溃点的第 9 帧中,这是对 REAL_FN 的调用).我无法通过查看它们的内容来确定哪个指针是正确的,但我假设它是从目标应用程序传递给我的包装器的那个.两个指针都指向程序数据的有效段(标记为 ALLOC、LOAD、HAS_CONTENTS).

As you can see, neither of these functions modifies cond, yet it is not the same in the two frames. Examining the two different pointers in a core dump shows that they point to different contents, as well. I can also see in the core dump that cond does not appear to change in my wrapper function (i.e. it's still equal to 0x5... in frame 9 at the crash point, which is the call to REAL_FN). I can't really tell which pointer is correct by looking at their contents, but I'd assume it's the one passed in to my wrapper from the target application. Both pointers point to valid segments for program data (marked ALLOC, LOAD, HAS_CONTENTS).

我的工具肯定会以某种方式导致错误,如果没有附加目标应用程序,它可以正常运行.我错过了什么?

My tool is definitely causing the error somehow, the target application runs fine if it is not attached. What am I missing?

更新:实际上,这似乎不是导致错误的原因,因为在错误发生之前对我的 pthread_cond_wait() 包装器的调用多次成功,并且每次都表现出类似的行为(在没有解释的情况下在帧之间更改指针值).不过,我将问题悬而未决,因为我仍然不明白这里发生了什么,我想学习.

UPDATE: Actually, this doesn't appear to be what's causing the error, because calls to my pthread_cond_wait() wrapper succeed many times before the error occurs, and exhibit similar behavior (pointer value changing between frames without explanation) each time. I'm leaving the question open, though, because I still don't understand what's going on here and I'd like to learn.

更新 2:根据要求,这里是 tracer.add_event() 的代码:

UPDATE 2: As requested, here's the code for tracer.add_event():

// add an event to the calling thread's history
// hist_entry ctor gets timestamp & stack trace
void tracer::add_event(event e, size_t obj_addr) {
        size_t tid = get_tid();
        hist_map::iterator hist = histories.contains(tid);
        assert(hist != histories.end());
        hist_entry ev (e, obj_addr);
        hist->second.push_back(ev);
}

// hist_entry ctor:
hist_entry::hist_entry(event e, size_t obj_addr) :
        ts(chrono::steady_clock::now()), ev(e), addr(obj_addr) {

        // these are set in the tracer ctor     
        assert(start_addr && end_addr);

        void* buf[TRACE_DEPTH];
        int v = backtrace(buf, TRACE_DEPTH);
        int a = 0;
        // find first frame outside of our own code
        while (a < v && start_addr < (size_t) buf[a] &&
                end_addr > (size_t) buf[a]) ++a;
        // skip requested amount of frames
        a += TRACE_SKIP;
        if (a >= v) a = v-1;
        caller = buf[a];
}

history 是来自 libcds 的无锁并发哈希映射(映射 tid-> hist_entry 的每线程向量),并且它的迭代器也保证是线程安全的.GNU 文档说 backtrace() 是线程安全的,CPP 文档中没有提到 stable_clock::now() 的数据竞争.get_tid() 只是使用与包装函数相同的方法调用 pthread_self(),并将其结果强制转换为 size_t.

histories is a lock-free concurrent hashmap from libcds (mapping tid->per-thread vectors of hist_entry), and its iterators are guaranteed to be thread-safe as well. GNU docs say backtrace() is thread-safe, and there's no data races mentioned in the CPP docs for steady_clock::now(). get_tid() just calls pthread_self() using the same method as the wrapper functions, and casts its result to size_t.

推荐答案

哈,搞定了!问题在于 Glibc 公开了多个版本的 pthread_cond_wait(),以实现向后兼容性.我在问题中重现的版本是当前版本,即我们想要调用的版本.dlsym() 找到的版本是向后兼容的版本:

Hah, figured it out! The issue is that Glibc exposes multiple versions of pthread_cond_wait(), for backwards compatibility. The version I reproduce in my question is the current version, the one we want to call. The version that dlsym() was finding is the backwards-compatible version:

int
__pthread_cond_wait_2_0 (pthread_cond_2_0_t *cond, pthread_mutex_t *mutex)
{
  if (cond->cond == NULL)
    {
      pthread_cond_t *newcond;

      newcond = (pthread_cond_t *) calloc (sizeof (pthread_cond_t), 1);
      if (newcond == NULL)
        return ENOMEM;

      if (atomic_compare_and_exchange_bool_acq (&cond->cond, newcond, NULL))
        /* Somebody else just initialized the condvar.  */
        free (newcond);
    }

  return __pthread_cond_wait (cond->cond, mutex);
}

如您所见,此版本对当前版本进行了尾调用,这可能就是为什么要花这么长时间才能检测到的原因:GDB 通常非常擅长检测被尾调用消除的帧,但我猜它没有检测到这一点,因为功能具有相同"名称(并且错误不会影响互斥函数,因为它们不公开多个版本).这篇博文进入更多细节,巧合的是关于 pthread_cond_wait().我在调试过程中多次遍历这个函数并对其进行调整,因为对 glibc 的每次调用都包含在多个间接层中;当我在 pthread_cond_wait 符号上设置断点而不是行号时,我才意识到发生了什么,并在此函数处停止.

As you can see, this version tail-calls the current one, which is probably why this took so long to detect: GDB is normally pretty good at detecting frames elided by tail calls, but I'm guessing it didn't detect this one because the functions have the "same" name (and the error doesn't affect the mutex functions because they don't expose multiple versions). This blog post goes into much more detail, coincidentally specifically about pthread_cond_wait(). I stepped through this function many times while debugging and sort of tuned it out, because every call into glibc is wrapped in multiple layers of indirection; I only realized what was going on when I set a breakpoint on the pthread_cond_wait symbol, instead of a line number, and it stopped at this function.

无论如何,这解释了改变指针的现象:会发生什么是旧的、不正确的函数被调用,将 pthread_cond_t 对象重新解释为一个包含指向 pthread_cond_t 对象的指针的结构,为该指针分配一个新的 pthread_cond_t,然后通过新分配的一个给新的、正确的函数.旧函数的框架被尾调用消除了,在离开旧函数后的 GDB 回溯中,看起来正确的函数是直接从我的包装器中调用的,并且参数发生了神秘的变化.

Anyway, this explains the changing pointer phenomenon: what happens is that the old, incorrect function gets called, reinterprets the pthread_cond_t object as a struct containing a pointer to a pthread_cond_t object, allocates a new pthread_cond_t for that pointer, and then passes the newly allocated one to the new, correct function. The frame of the old function gets elided by the tail-call, and to a GDB backtrace after leaving the old function it looks like the correct function gets called directly from my wrapper, with a mysteriously changed argument.

对此的修复很简单:GNU 提供了 libdl 扩展 dlvsym(),它与 ​​dlsym() 类似,但也采用版本字符串.寻找带有版本字符串GLIBC_2.3.2"的 pthread_cond_wait;解决了这个问题.请注意,这些版本通常不对应于当前版本(即 pthread_create()/exit() 具有版本字符串GLIBC_2.2.5"),因此需要在每个函数的基础上查找它们.可以通过查看 glibc 源代码中函数定义附近某处的 compat_symbol() 或 versioned_symbol() 宏来确定正确的字符串,或者通过使用 readelf 查看编译库中的符号名称(我的有";pthread_cond_wait@@GLIBC_2.3.2"和pthread_cond_wait@@GLIBC_2.2.5").

The fix for this was simple: GNU provides the libdl extension dlvsym(), which is like dlsym() but also takes a version string. Looking for pthread_cond_wait with version string "GLIBC_2.3.2" solves the problem. Note that these versions do not usually correspond to the current version (i.e. pthread_create()/exit() have version string "GLIBC_2.2.5"), so they need to be looked up on a per-function basis. The correct string can be determined either by looking at the compat_symbol() or versioned_symbol() macros that are somewhere near the function definition in the glibc source, or by using readelf to see the names of the symbols in the compiled library (mine has "pthread_cond_wait@@GLIBC_2.3.2" and "pthread_cond_wait@@GLIBC_2.2.5").

这篇关于传递给函数的指针意外更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆