如何调试仅在负载巨大时才会出现的错误? [英] How do you debug the bug that only appears when the load is huge?
问题描述
我们当前正在使用C语言开发集群管理器软件。如果有多个节点连接到管理器,则可以正常工作,但是如果我们使用一些工具来模拟1000个节点来连接管理器,则有时它会以意外的方式工作。
如何调试这种错误?
如果我使用 gdb
逐步调试,它只会在负载(连接/节点)很大时显示吗?
如何调试这种错误?
通常,您至少要使用以下技术:
- 确保代码编译和链接时没有警告。
-Wall
是一个好的开始,但-Wextra
更好。 - 确保应用程序具有内置的日志记录和跟踪功能,可以打开或关闭它们,并具有足够的详细信息来调试此类问题,并且开销较低。
- 确保该代码具有良好的单元测试覆盖率。
- 确保测试是清洁的。
< blockquote>
在valgrind检查中也没有警告。
目前尚不清楚您是否只是运行了Valgrind下的目标应用程序,或者您是否还具有单元测试,并且这些测试是Valgrind-clean的。还不清楚您是否在Valgrind下观察到应用程序的异常行为。
Valgrind曾经是解决堆和非初始化内存问题的最佳工具,但是在2017年,情况不再如此。
基于编译器的地址,线程和内存消毒剂捕获的错误类别大得多(例如,全局和堆栈溢出以及数据竞争),您应该
当以上所有方法仍然找不到问题时,您也许可以运行装有消毒剂的真实应用程序。 / p>
最后,还有 GDB跟踪和 systemtap -它们较难学习,但可以赋予您强大的功能。概述此处。
We are currently developing a cluster manager software in C. If several nodes connect to the manager, it works perfect, but if we use some tools to simulate 1000 nodes to connect the manager, it will sometimes work in unexpected ways.
How can one debug this kind of bug? It only appears when the load(connection/nodes) is large?
If I use gdb
to debug step by step, the app never malfunctions.
How to debug this kind of bug?
In general, you want to use at least these techniques:
- Make sure the code compiles and links without warnings. The
-Wall
is a good start, but-Wextra
is better. - Make sure the application has designed-in logging and tracing, which can be turned on or off, and which has sufficient details to debug these kinds of issues, and low overhead.
- Make sure the code has good unit-test coverage.
- Make sure the tests are sanitizer-clean.
there's also no warning in valgrind check.
It's not clear whether you've simply ran the target application under Valgrind, or whether you also have the unit tests, and the tests are Valgrind-clean. It's also not clear whether you've observed the application mis-behavior under Valgrind or not.
Valgrind used to be the best tool available for heap and unintialized memory problems, but in 2017 this is no longer the case.
Compiler-based Address, Thread and Memory sanitizers catch significantly wider class of errors (e.g. global and stack overflows, and data races), and you should run your unit tests under all of them.
When all of the above still fails to find the problem, you may be able to run the real application instrumented with sanitizers.
Lastly, there are tools like GDB tracing and systemtap -- they are harder to learn, but give you significant power. Overview here.
这篇关于如何调试仅在负载巨大时才会出现的错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!