Myers diff算法与Hunt-McIlroy算法 [英] Myers diff algorithm vs Hunt–McIlroy algorithm

查看:156
本文介绍了Myers diff算法与Hunt-McIlroy算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最长的常见子序列问题是经典的计算机科学问题,解决该问题的算法版本控制系统和Wiki引擎的根目录。两种基本算法是 Hunt–McIlroy算法,用于创建原始算法版本 diff Myers diff算法,目前由 GNU diff实用程序使用。通过在代表两个字符串或文本文件之间的编辑空间的图形中找到最短路径,似乎两者都或多或少起作用。编辑空间是将一个序列转换为另一个序列所需的插入或删除次数。那么Myer的diff算法和Hunt-McIlroy算法到底有什么区别?

The longest common subsequence problem is a classic computer science problem, algorithms to solve it are the root of version control systems and wiki engines. Two basic algorithms are the Hunt–McIlroy algorithm which was used to create the original version of diff, and the Myers diff algorithm which is used by the GNU diff utility currently. Both seem to work more or less by finding the shortest path through a graph that represents the edit space between the two strings or text files. The edit space is the number of inserts or deletes necessary to transform one sequence into the other. So what exactly is the difference between Myer's diff algorithm and the Hunt–McIlroy algorithm?

推荐答案


  • Myers算法是一种分而治之算法:它通过递归查找两个具有最小编辑脚本的序列的中心匹配来工作。完成此操作后,只会记住匹配项,然后再次递归比较之前和之后的两个子序列,直到没有其他可比较的为止。通过尽可能地匹配子序列的末端来找到中心匹配项,并且在不可能的情况下,将编辑脚本增加1,扫描每个对角线到达的最远位置,然后再看多远匹配可以进行,如果匹配遇到另一端的匹配,则算法只是找到了中心匹配。这种方法的优点是可以占用O(m + n)内存,并且可以在 O(nd)中执行,这要取决于编辑脚本的复杂性。

    • The Myers algorithm is a "divide and conquer algorithm": it works by finding recursively the central match of two sequences with the smallest edit script. Once this is done only the match is memorized, and the two subsequences preceding and following it are compared again recursively until there is nothing more to compare. Finding the central match is done by matching the ends of sub-sequences as far as possible, and any time it is not possible, augment the edit script by 1, scanning each furthest position attained up to there for each diagonal and see how far again matches can go, if a match encounter a match of the other end, the algorithm just found the central match. This approach has the advantage to occupy O(m+n) memory, and executes in O(nd), d behing the edit script complexity.

      Hunt和McIlroy 方法计算整个文件中的匹配项,并将它们索引为所谓的k个候选对象。 k是LCS长度。通过找到属于适当纵坐标内的匹配项(遵循其论文中解释的规则),逐步增强LCS。在执行此操作时,将记住每条路径。这种方法的问题在于它做的工作比必要的多:它存储所有路径,在最坏的情况下存储 O(mn)内存,并存储 o(mn log m)

      The Hunt and McIlroy approach compute matches in the whole file, and indexes them into so called k-candidates. k being the LCS length. The LCS is augmented progressively by finding matches that fall within proper ordinates (following a rule explained in their paper). While doing this each path is memorized. The problem with the approach is that it does more job than necessary: it memorizes all the paths, O(mn) memory in the worst case, and o(mn log m) for the time complexity.

      Myers算法之所以能胜出,是因为它在工作时不记住路径,并且确实无需预见前进的方向,这样做可以随时将注意力集中在以最小复杂度的编辑脚本可以到达的最远位置。这种最小的复杂度可确保找到的是LCS,并且不会记住找到匹配项所经过的路径,直到找到匹配项时才使用较少的内存。不尝试提前计算所有匹配项将避免它们的存储,并使它们在匹配时受益,而不是浪费内存。

      The Myers algorithm mostly wins because it does not memorize the paths while working, and does not need to "foresee" where to go, doing so it can concentrate at any time only on the furthest positions it could reach with an edit script of smallest complexity. This "smallest complexity" ensures that what is found is the LCS, and not memorizing through which path it went until it found a match uses less memory. Not trying to compute all matches ahead of time avoids their storage, and make them a benefit at match time rather than a memory hog.

      这篇关于Myers diff算法与Hunt-McIlroy算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆