diff / patch如何工作以及它们有多安全? [英] How do diff/patch work and how safe are they?

查看:165
本文介绍了diff / patch如何工作以及它们有多安全?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于他们的工作方式,我想知道低级别的工作内容:


  1. 什么会触发合并冲突?

  2. 这些工具是否也使用上下文来应用修补程序?

  3. 它们如何处理实际上不修改源代码行为的更改?例如,交换函数定义的地方。

关于安全性,真相被告知,巨大的Linux内核存储库证明了它们的安全性。但我想知道以下几点:


  1. 是否有关于用户应该知道的工具的任何警告/限制?
  2. >
  3. 是否已经证明算法不会产生错误的结果?如果不是,那么是否有实现/论文提出集成测试,至少证明它们是错误的 - 自由经验?像这些文件的内容 BrianKorver JamesCoplien

  4. 同样,Linux存储库对于以前的观点,但我想知道更通用的东西。源代码即使发生了变化,也不会有太大变化(特别是因为实现了算法和语法限制),但是可以将安全性推广到通用文本文件吗?



编辑



好的人,我在编辑,因为问题很模糊,答案也没有详细说明。



Git / diff / patch details



统一差异格式,Git似乎默认使用,基本上输出三样东西:变化,围绕变化的上下文以及与上下文相关的行号。这些东西中的每一个都可能同时也可能不会被改变,所以Git基本上必须处理8种可能的情况。

例如,如果行被添加或删除在上下文之前,行号会有所不同;但如果上下文和变化仍然相同,则diff可以使用上下文本身来对齐文本并应用补丁(我不知道这是否确实发生)。现在,其他情况会发生什么?我想知道Git如何决定自动应用更改以及何时决定发布错误并让用户解决冲突的细节。



可靠性



我非常确定Git是完全可靠的,因为它具有完整的提交历史并且可以遍历历史记录。我想要的是一些关于这方面的学术研究和参考的指针,如果它们存在的话。

这个主题仍然有点相关,我们知道Git / diff将文件视为通用文本文件和在线工作。此外,diff使用的LCS算法会生成一个补丁,试图将变化的次数降到最低。



所以,我还想知道一些事情:


  1. 为什么使用LCS而不是其他字符串度量算法?
  2. 如果使用LCS,为什么不使用modified指标的版本是否考虑了底层语言的语法方面?
  3. 如果使用考虑到语法方面的这种指标,它们是否会提供好处?这种情况下的好处可能是什么,例如,一个更清洁的责怪日志。

再次,这些可能是巨大的话题和学术文章是受欢迎的。

解决方案


什么会触发合并冲突? b

让我们看看最简单的git的合并策略递归,首先:合并两个分支时,比如 a b ,它们具有共同的祖先 c ,git会创建一个补丁,从commit c 转到提交广告 a 的头部,并尝试将该补丁应用到树头部的 b'/ strong>即可。如果补丁失败,那就是合并冲突。



git默认使用递归策略, 3路合并。总体思路是相同的:如果链接中描述的3-way合并算法失败,因为来自不同分支的两个提交改变了相同的行,那就是合并冲突。


工具是否也使用上下文来应用修补程序?

是。如果修补程序不适用于存储在diff文件中的确切行号,则patch会根据上下文尝试在原始行的附近找到几行。


它们如何处理实际上不修改源代码行为的更改?例如,交换函数定义的地方。


补丁不是智能的,它不能区分这种变化。它将移动的函数视为一对已添加的行和已删除的行。如果一个分支上的提交改变了一个函数,而另一个上的提交改变了,那么合并的尝试总会给你一个合并冲突。



对于补丁和差异:没有关于用户应该注意的工具的任何警告/限制。两者都使用自20世纪70年代初期以来一直存在的算法,并且非常强大。只要他们不抱怨,你可以相当肯定他们做了你想要的。

这就是说: git merge 尝试自行解决合并冲突。在极少数情况下,可能在这里出错 - 此页面有一个接近它的结尾的例子。


算法是否被证明不会产生错误的结果?
如果没有,是否有实现/论文提出集成测试,至少证明它们是无差错的经验?

错误的结果在这方面是一个非常不明确的术语;我声称它不能被证明。经验证明的是,将由 diff ab 生成的补丁应用于文件 a 将在任何情况下产生文件 b


源代码即使发生变化也不会有太大变化(特别是因为算法的实现和语法的限制),但是可以将安全性推广到通用文本文件吗?

同样,diff / patch / git的确不区分源代码和其他文本文件。 git在通用文本文件上的工作方式与源代码一样。


我非常确定Git是完全可靠的,因为它确实有完整的
提交历史并且可以遍历历史记录。我想要的是一些
的指向学术研究和参考资料,如果它们存在。


git中的提交是具有元数据的树的快照,不会与相邻版本不同。补丁和差异根本不涉及修订遍历。 (但是在表面之下一层,git然后在使用delta压缩算法的文件包中组织文件夹。这里的错误很容易被发现,因为git内部使用sha1和来标识文件,如果发生错误,总和也会改变。)
$ b


为什么使用LCS而不是其他字符串度量算法?

默认情况下,git使用Myers的算法。 原始文件解释了它为何如此运作。 (这不完全是LCS。)


Regarding how they work, I was wondering low-level working stuff:

  1. What will trigger a merge conflict?
  2. Is the context also used by the tools in order to apply the patch?
  3. How do they deal with changes that do not actually modify source code behavior? For example, swapping function definition places.

Regarding safety, truth be told, the huge Linux kernel repository is a testament for their safety. But I wondering about the following points:

  1. Are there any caveats/limitations regarding the tools that the user should be aware of?
  2. Have the algorithms been proven to not generate wrong results?
  3. If not, are there implementations/papers proposing integration testing that at least prove them to be error-free empirically? Something like the content of these papers BrianKorver and JamesCoplien.
  4. Again, the Linux repository should suffice regarding the previous point, but I was wondering about something more generic. Source code, even when changed, will not change much (specially because of the algorithm implemented and syntax restrictions), but can the safety be generalized to generic text files?

Edit

Ok people, I'm editing since the question is vague and answers are not addressing details.

Git/diff/patch details

The unified diff format, which Git seems to use by default, basically outputs three things: the change, the context surrounding the change, and line numbers pertinent to the context. Each one of these things may or may not have been changed concurrently, so Git basically has to deal with 8 possible cases.

For example, if lines have been added or removed before the context, line numbers will be different; but if the context and the changes are still the same, then diff could use the context itself to align the texts and apply the patch (I do not know if this indeed happens). Now, what would happen on the other cases? I would like to know details of how Git decides to apply changes automatically and when it decides to issue an error and let the user resolve the conflict.

Reliability

I'm pretty much sure the Git is fully reliable since it do have the full history of commits and can traverse history. What I would like is some pointers to academic research and references regarding this, if they exist.

Still kinda related to this subject, we know that Git/diff treat files as generic text files and work on lines. Furthermore, the LCS algorithm employed by diff will generate a patch trying to minimize the number of changes.

So here are some things I would like to know also:

  1. Why is LCS used instead of other string metric algorithms?
  2. If LCS is used, why not use modified versions of the metric that do take into account the grammatical aspects of the underlying language?
  3. If such a metric that takes into account grammatical aspects are used, could they provide benefits? Benefits in this case could be anything, for example, a cleaner "blame log".

Again, these could be huge topics and academic articles are welcome.

解决方案

What will trigger a merge conflict?

Let's look at the simplest of git's merge strategies, recursive, first: When merging two branches, say a and b, that have a common ancestor c, git creates a patch to go from commit c to the commit ad the head of a and tries to apply that patch to the tree at the head of b. If the patch fails, that's a merge conflict.

git by default uses the recursive strategy, a 3-way merge. The general idea is the same: If the 3-way merge algorithm described in the link fails because two commits from different branches changed the same lines, that's a merge conflict.

Is the context also used by the tools in order to apply the patch?

Yes. If a patch does not apply at the exact line number stored in the diff file, patch tries to find the right line a couple of lines adjacent to the original one based on the context.

How do they deal with changes that do not actually modify source code behavior? For example, swapping function definition places.

patch is not intelligent, it can not differentiate between such changes. It regards a moved function as a couple of added and a couple of deleted lines. If a commit on one branch alters a function and a commit on another moves the unaltered, then an attempt to merge will always give you a merge conflict.

Are there any caveats/limitations regarding the tools that the user should be aware of?

As for patch and diff: No. Both use algorithms that have been around since the early 1970s and are quite robust. As long as they don't complain, you can be fairly certain that they did what you intended.

That being said: git merge tries to resolve merge conflicts on its own. In some rare cases, things can go wrong here - this page has an example close to its end.

Have the algorithms been proven to not generate wrong results? If not, are there implementations/papers proposing integration testing that at least prove them to be error-free empirically?

"wrong results" is a fairly unspecific term in this context; I'd claim it cannot be proven. What is empirically proven is that applying a patch generated by diff a b to file a will in any case produce file b.

Source code, even when changed, will not change much (specially because of the algorithm implemented and syntax restrictions), but can the safety be generalized to generic text files?

Again, diff/patch/git does not differentiate between source code and other text files. git works as well on generic text files as it does on source code.

I'm pretty much sure the Git is fully reliable since it do have the full history of commits and can traverse history. What I would like is some pointers to academic research and references regarding this, if they exist.

Commits in git are snapshots of the tree with meta data, not diffs to the adjacent versions. Patch and diff are not involved in revision traversal at all. (But one level below the surface, git then organizes blobs in pack files that do use a delta compression algorithm. Errors here would be easy to spot because git internally uses sha1 sums to identify files, and the sum would change if an error occurred.)

Why is LCS used instead of other string metric algorithms?

git uses Myers' algorithm by default. The original paper explains why it works the way it does. (It's not purely LCS.)

这篇关于diff / patch如何工作以及它们有多安全?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆