如何修复无法复制的错误? [英] How do you fix a bug you can't replicate?

查看:191
本文介绍了如何修复无法复制的错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题是这样的。如果您有多个用户报告的错误,但是在日志中没有发生错误的记录,也不会重复错误,无论您尝试多么努力,您如何解决?或者甚至可以吗?



我确信这已经发生在许多人身上。在这种情况下你做了什么,最终结果是什么?






编辑:
我更多感兴趣的是关于不可指定的错误,而不是无法解决的错误。无法解决的错误是,您至少知道有一个问题,并有一个起点,在大多数情况下,用于搜索它。在一个不可指定的情况下,你做什么?你甚至可以做任何事情吗?

解决方案

这些被称为 Heisenbugs



语言



不同的编程语言将有自己的错误。



C



添加调试语句可能会导致问题<一个href =http://c2.com/cgi/wiki?HeisenBug =nofollow noreferrer>不可能复制,因为调试语句本身会转移指针(足够远以避免SEGFAULT)。指针问题是追踪和复制的噩梦,但有调试器(如 GDB DDD ),可以帮助。



< h2> Java

具有多个线程的应用程序可能只会显示具有特定时间或事件序列的错误。



JavaScript



某些网络浏览器是臭名远扬的内存泄漏。在一个浏览器中运行正常的JavaScript代码可能会导致其他浏览器中的错误行为。使用经过数千用户严格测试的第三方库可能有利于避免某些模糊的错误。



环境



根据运行应用程序(有错误)的环境的复杂性,唯一的追索可能是简化环境。应用程序是否在服务器上运行:




  • 桌面上


  • 在网络浏览器中?



应用程序在什么环境中产生问题?




  • 开发?

  • test?

  • production?



退出无关的应用程序,终止后台任务,停止所有计划的事件(cron作业),消除插件和卸载浏览器插件。 >

网络



由于网络对于许多应用程序至关重要:




  • 确保稳定的网络连接,包括无线信号。

  • 网络故障后,软件是否重新连接?

  • 所有连接都正确关闭,以释放内存?

  • 使用机器的人是不应该的?

  • 流氓设备与机器的网络?

  • 是否有工厂或无线电塔可能造成干扰?

  • 分组大小和频率是否落在标称范围内?

  • 所有网络设备是否足够用于繁忙的带宽使用?



一致性



消除尽可能多的未知数:




  • 隔离建筑组件。

  • 删除非必要或可能有问题(冲突)的元素。 >
  • 停用不同的应用程序模块。



删除生产,测试和开发之间的所有区别。使用相同的硬件。完全按照完全相同的步骤设置电脑。一致性是关键。



记录



使用自由量记录来关联发生的事件。检查日志是否有任何明显的错误,计时问题等。



硬件



如果软件似乎还行,请考虑硬件故障:




  • 物理网络连接是否牢固?

  • 电缆是否松动? / li>
  • 芯片是否正常安装?

  • 所有电缆是否连接清洁?

  • 工作环境是否清洁,没有灰尘?

  • 有任何隐藏的设备或电缆被啮齿动物损坏或昆虫

  • 驱动器上是否有坏块?

  • CPU风扇是否正常工作? >
  • 主板能否供电所有组件? (CPU,网卡,视频卡,驱动器等)

  • 可以作为罪魁祸首?



主要用于嵌入式:




  • 供应不足绕过?

  • 板污染?

  • 坏焊点/不良回流?

  • 当电源电压超出容限时,CPU不会重置?

  • 因为电源轨由I / O端口反向供电,不能完全放电?

  • 锁存器?

  • 浮动输入引脚? li>
  • 逻辑水平上的噪音余量不足(有时为负)?

  • 时间裕度不足(有时是负数)?

  • 锡晶须?

  • ESD损坏?

  • ESD烦扰?

  • 芯片勘误? b $ b
  • 接口误用(例如,I2C脱机或存在大功率si gnals)?

  • 竞赛条件?

  • 假冒组件?



网络与本地



当您在本地运行应用程序(即不在网络中)会发生什么情况?其他服务器是否遇到同样的问题?数据库是远程的?您可以使用本地数据库吗?



固件



硬件和软件之间是固件。




  • 计算机BIOS是否为最新版本?

  • BIOS电池是否正常工作?

  • BIOS时钟和系统时钟是否同步?



时间和统计信息



时间问题很难追踪:




  • 问题发生在何时?

  • <这个时间有多少?
  • 当时还有哪些系统正在运行?

  • 应用程序是否具有时间敏感性(例如,秒导致问题)?



收集有关问题的硬数值数据。一开始可能会出现随机的问题,实际上可能会有一个模式。



变更管理



有时候




  • 问题第一次开始?

  • 环境(硬件和软件)?

  • 回滚到以前的版本后会发生什么?

  • 有问题的版本和良好版本之间存在什么区别? ?



图书馆管理



不同的操作系统有不同的分配方式图书馆:





执行操作系统的全新安装,并且只包括支持您的应用程序所需的软件。



Java



确保每个库仅使用一次。有时应用程序容器的库版本与应用程序本身不同。这可能无法在开发环境中复制。



使用库管理工具,如 Maven Ivy



调试



编写触发通知的检测方法(例如日志,电子邮件,弹出式窗口,寻呼机蜂鸣器)该错误发生。使用自动测试将数据提交到应用程序。使用随机数据。使用涵盖已知和可能的边缘情况的数据。最终应该重新出现错误。



睡眠



值得重申其他人提到的内容:睡觉。花时间离开问题,完成其他任务(如文档)。远离电脑,进行一些运动。



代码审查



浏览代码,并描述每一行对自己,同事或橡皮鸭。这可能会导致有关如何重现错误的见解。



宇宙辐射



宇宙射线可以翻转位。由于内存的现代错误检查,这不是过去的一个问题。离开地球保护的硬件软件将受到由于宇宙辐射随机性而无法复制的问题。



工具



不常发生,特别是对于利基工具(例如,微控制器'C'编译器遭受符号表溢出)。


The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?

I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?


Edit: I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?

解决方案

These are known as Heisenbugs.

Language

Different programming languages will have their own flavour of bugs.

C

Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers (far enough to avoid a SEGFAULT). Pointer issues are a nightmare to track and replicate, but there are debuggers (such as GDB and DDD) that can help.

Java

An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.

JavaScript

Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.

Environment

Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:

  • on a server?
  • on a desktop?
  • in a web browser?

In what environment does the application produce the problem?

  • development?
  • test?
  • production?

Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.

Networking

As networking is essential to so many applications:

  • Ensure stable network connections, including wireless signals.
  • Does the software reconnect after network failures robustly?
  • Do all connections get closed properly so as to release memory?
  • Are people using the machine who shouldn't be?
  • Are rogue devices interacting with the machine's network?
  • Are there factories or radio towers nearby that can cause interference?
  • Do packet sizes and frequency fall within nominal ranges?
  • Are all network devices adequate for heavy bandwidth usage?

Consistency

Eliminate as many unknowns as possible:

  • Isolate architectural components.
  • Remove non-essential, or possibly problematic (conflicting), elements.
  • Deactivate different application modules.

Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.

Logging

Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.

Hardware

If the software seems okay, consider hardware faults:

  • Are the physical network connections solid?
  • Are there any loose cables?
  • Are chips seated properly?
  • Do all cables have clean connections?
  • Is the working environment clean and free of dust?
  • Have any hidden devices or cables been damaged by rodents or insects?
  • Are there bad blocks on drives?
  • Are the CPU fans working?
  • Can the motherboard power all components? (CPU, network card, video card, drives, etc.)
  • Could electromagnetic interference be the culprit?

And mostly for embedded:

  • Insufficient supply bypassing?
  • Board contamination?
  • Bad solder joints / bad reflow?
  • CPU not reset when supply voltages are out of tolerance?
  • Bad resets because supply rails are back-powered from I/O ports and don't fully discharge?
  • Latch-up?
  • Floating input pins?
  • Insufficient (sometimes negative) noise margins on logic levels?
  • Insufficient (sometimes negative) timing margins?
  • Tin whiskers?
  • ESD damage?
  • ESD upsets?
  • Chip errata?
  • Interface misuse (e.g. I2C off-board or in the presence of high-power signals)?
  • Race conditions?
  • Counterfeit components?

Network vs. Local

What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?

Firmware

In between hardware and software is firmware.

  • Is the computer BIOS up-to-date?
  • Is the BIOS battery working?
  • Are the BIOS clock and system clock synchronized?

Time and Statistics

Timing issues are difficult to track:

  • When does the problem happen?
  • How frequently?
  • What other systems are running at that time?
  • Is the application time-sensitive (e.g., will leap days or leap seconds cause issues)?

Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.

Change Management

Sometimes problems appear after a system upgrade.

  • When did the problem first start?
  • What changed in the environment (hardware and software)?
  • What happens after rolling back to a previous version?
  • What differences exist between the problematic version and good version?

Library Management

Different operating systems have different ways of distributing conflicting libraries:

  • Windows has DLL Hell.
  • Unix can have numerous broken symbolic links.
  • Java library files can be equally nightmarish to resolve.

Perform a fresh install of the operating system, and include only the supporting software required for your application.

Java

Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.

Use a library management tool such as Maven or Ivy.

Debugging

Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.

Sleep

It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.

Code Review

Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.

Cosmic Radiation

Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.

Tools

Infrequent but it happens, especially for niche tools (e.g. a microcontroller 'C' compiler suffered from symbol table overflow).

这篇关于如何修复无法复制的错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆