跟踪生产linux服务器上的内存损坏 [英] Tracing memory corruption on a production linux server

查看:236
本文介绍了跟踪生产linux服务器上的内存损坏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

伙计,你能推荐一个工具来发现内存损坏的生产多线程服务器用c + +构建和工作在linux x86_64下?我目前面临以下问题:每几个小时我的服务器崩溃与一个segfault和核心转储显示错误发生在malloc / calloc这肯定是内存被破坏的地方的迹象。



其实我已经尝试了一些工具没有太多的运气。这是我迄今为止的经验:




  • Valgrind是一个很好的工具服务器太多,使其无法在生产中使用。我试了一台舞台服务器,它真的帮助我找到一些内存相关的问题,但即使修正后,我仍然在生产服务器上崩溃。我在Valgrind下运行我的舞台服务器几个小时,但仍然没有发现任何严重的错误。


  • ElectricFence据说是一个真正的记忆猪,但我甚至不能正常工作。它几乎立刻在舞台服务器上的随机奇怪的地方,Valgrind没有显示任何问题。可能电场不支持穿线好吗?..我不知道。


  • DUMA与ElectricFence相同,但更糟糕。虽然EF生成的核心转储与可读的backtraces DUMA显示我只有?????(和服务器是建立与-g标志肯定)


  • dmalloc - 我配置服务器使用它,而不是标准的malloc例程,但它挂起几分钟后。附加gdb到进程显示它挂在dmalloc的某处:(




我逐渐变得疯狂,不知道下一步该做什么我有以下工具要尝试:mtrace,mpatrol,但也许有人有一个更好的主意。



我非常感谢任何帮助

更新:我设法找到错误的来源,但我发现它在舞台服务器上,而不是生产一个使用helgrind / DRD / tsan - 在几个线程之间有一个数据库导致内存损坏。关键是使用正确的valgrind抑制,因为这些工具显示太多的假阳性仍然我不知道如何可以发现生产服务器没有任何显着的减速...

解决方案

我设法找到错误的来源,在舞台服务器上使用helgrind / DRD / tsan - 在几个线程之间有一个数据库,导致内存损坏。关键是使用正确的 valgrind抑制,因为这些工具显示太多的假阳性。仍然我不知道如何可以在生产服务器上发现没有任何显着的减速...


Guys, could you please recommend a tool for spotting a memory corruption on a production multithreaded server built with c++ and working under linux x86_64? I'm currently facing the following problem : every several hours my server crashes with a segfault and the core dump shows that error happens in malloc/calloc which is definitely a sign of memory being corrupted somewhere.

Actually I have already tried some tools without much luck. Here is my experience so far:

  • Valgrind is a great(I'd even say best) tool but it slows down the server too much making it unusable in production. I tried it on a stage server and it really helped me find some memory related issues but even after fixing them I still get crashes on the production server. I ran my stage server under Valgrind for several hours but still couldn't spot any serious errors.

  • ElectricFence is said to be a real memory hog but I couldn't even get it working properly. It segfaults almost immediately on the stage server in random weird places where Valgrind didn't show any issues at all. Maybe ElectricFence doesn't support threading well?.. I have no idea.

  • DUMA - same story as ElectricFence but even worse. While EF produced core dumps with readable backtraces DUMA shows me only "?????"(and yes server is built with -g flag for sure)

  • dmalloc - I configured the server to use it instead of standard malloc routines however it hangs after several minutes. Attaching a gdb to the process reveals it's hung somewhere in dmalloc :(

I'm gradually getting crazy and simply don't know what to do next. I have the following tools to be tried: mtrace, mpatrol but maybe someone has a better idea?

I'd greatly appreciate any help on this issue.

Update: I managed to find the source of the bug. However I found it on the stage server not production one using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

解决方案

Folks, I managed to find the source of the bug. However I found it on the stage server using helgrind/DRD/tsan - there was a datarace between several threads which resulted in memory corruption. The key was to use proper valgrind suppressions since these tools showed too many false positives. Still I don't really know how this can be discovered on the production server without any significant slowdowns...

这篇关于跟踪生产linux服务器上的内存损坏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆