两个目录中的相同文件的Git Diff总是导致“重命名” [英] Git Diff of same files in two directories always result in "renamed"

查看:146
本文介绍了两个目录中的相同文件的Git Diff总是导致“重命名”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

git diff --no-index --no-prefix --summary -U4000 directory1 directory2



这样可以正常工作,因为它返回所有两个目录之间的文件。添加的文件按预期输出,删除的文件也会导致预期的差异输出。然而,由于diff将文件路径作为文件名的一部分考虑在内,所以在两个不同的目录中具有相同名称的文件会导致diff输出重命名的标志,而不是改变。


  1. 有没有办法告诉git不考虑diff中的完整文件路径,只查看文件名称,好像这些文件来自同一个目录?


  2. 有没有办法让git真正知道同一个文件的副本是否在不同的目录中目录实际上改名了?我不明白这是怎么回事,除非它有办法比较文件md5s,或者某种东西(可能是一个糟糕的猜测)。


  3. 的目录很容易解决这个问题,如果是的话,上面列出的命令的分支版本是什么?

  4. 这里有很多问题,其答案交织在一起。让我们从重命名和复制检测开始,然后转到分支。



    重命名检测




    然而,由于diff会将文件路径作为文件名的一部分考虑在内,因此两个不同目录中具有相同名称的文件会导致重命名标志的差异输出,而不是更改。


    这不太对。 (下面的文本是为了解决你的项目1和2)。



    尽管您使用 - no-index (大概是为了让Git能够在仓库以外的目录上工作),Git的diff代码在所有情况下的行为都是一样的。为了区分(比较)两棵树中的两个文件,Git必须首先确定文件标识。也就是说,有两组文件:位于左侧或源树(第一个目录名称)中的文件,以及位于右侧或目标中的文件>树(第二个目录名称)。左侧的某些文件与右侧的某些文件相同。左侧的一些文件是不同的文件,它们没有相应的右侧文件,即它们已被删除。最后,右侧的一些文件是新的,即它们已被创建



    文件是相同文件不必具有相同的路径名称。在这种情况下,这些文件已被重命名。



    以下是详细说明。请注意,使用 git diff --no-index dir1 dir2 时,完整路径名称会有所修改:完整路径名称是剥离 dir1 dir2 前缀。



    比较左侧和右侧树时,具有相同完整路径名的文件通常被自动 视为同一个文件。我们将所有这些文件放入一个要扩散的文件的队列中,没有一个会显示为正在重命名。请注意通常这个词 - 我们稍后再回来。



    这给我们留下了两个剩余的文件列表:




    • 路径存在于左侧,但不是右侧:没有目标的源

    • 路径存在于右侧,但不是左边:没有源代码的目的地



    简单地说,我们可以简单地声明所有这些源文件已被删除,这些目标文件已创建。您可以指示 git diff 来执行此操作:设置 - 不重命名标志禁用重命名检测。 / p>

    或者,Git可以继续使用更智能的算法:设置 - 查找重命名和/或 -M <阈值> 标记来执行此操作。在Git版本2.9及更高版本中,重命名检测默认为开启。



    现在,Git如何决定源文件具有相同的身份作为目标文件?他们有不同的路径;右侧的文件在左侧对应于 a / b / c.txt ?它可能是 d / e / f.bin d / e / f.txt a / b / renamed.txt 等等。实际的算法相对简单,过去并没有将最终名称组件生效(我不确定它现在是否现在,Git不断发展):


    • 如果源文件和目标文件的内容完全匹配,请将它们配对。因为Git哈希内容,这个比较是非常快的。我们可以将左侧的 a / b / c.txt 的哈希ID与右侧的每个文件进行比较,只需查看所有他们的哈希ID。因此,我们首先浏览所有源文件,找到匹配的目标文件,将新对放入差异队列并将它们从两个列表中拉出。


    • 对于所有剩余的源文件和目标文件,运行一个有效但不适合 git diff 输出的算法来计算文件相似性。至少< threshold> 类似于某个目标文件的源文件会导致配对,并且该文件对将被删除。默认阈值为50%:如果您启用了重命名检测而没有选择特定的阈值,那么此时仍在列表中的两个文件是50%相似的,并且会配对。




    • 现在我们已经找到了所有的文件配对, git diff 会对配对的相同身份文件进行区分,并告诉我们删除了删除的文件,并创建新创建的文件。如果相同身份文件的两个路径名称不同, git diff 表示该文件被重命名。



      配对文件的代码是昂贵的(即使同名配对代码非常便宜),所以Git对这些配对的源和目标列表中的名称有多少限制。 。该限制通过 git config diff.renameLimit 进行配置。多年来默认值已经攀升,现在有数千个文件。您可以将它设置为 0 (零),以使Git始终使用自己的内部最大值。

      打破对



      上面,我说正常,具有相同名称的文件会自动配对。这通常是正确的,所以它是Git的默认设置。但是,在某些情况下,名为 a / b / c.txt 的左侧文件实际上并非与右侧相关文件名为 a / b / c.txt ,它的确与右边的 a / doc / c.txt 例如。我们可以告诉Git将太不同的文件配对。



      我们看到上面使用的相似性指数形成配对的文件。例如,可以使用相同的相似性索引来分割文件: -B20%/ 60%。这两个数字不需要加起来就可以达到100%,而且实际上可以省略其中的一个或两个:如果设置 -B 模式,则每个数字都有默认值。



      第一个数字表示默认配对文件可以放入重命名检测列表中。使用 -B20%时,如果文件是20%不相似(即只有80%相似),则文件进入源重命名列表。如果它永远不会被视为重命名,它可以重新配对它的自动目的地 - 但在这一点上,第二个数字,斜线后的数字生效。



      第二个数字表示配对确实被破坏的点。例如,如果文件是70%不相似的(即只有30%相似),则配对被破坏。例如,如果 -B / 70% (当然,如果该文件被作为重命名源被拿走,配对已经被打破。)



      复制检测



      除了通常的配对和重命名检测之外,您可以要求Git查找源文件的副本。在运行所有常用的配对代码(包括查找重命名和分解对)后,如果指定了 -C ,Git将查找新(即未配对)的目标文件实际上是从现有来源复制的。有两种模式,具体取决于您是否指定 -C 两次或添加 - find-copies-harder :只考虑修改过的源文件(即单个 -C 个案),以及每个 源文件(这是两个 -C - find-copies-harder 大小写)。请注意,在这种情况下,此源文件已修改意味着源文件已在配对队列中 - 如果不是,则不会通过定义修改其相应的目标文件有一个不同的哈希ID(再次,这是一个非常低成本的测试,这有助于保持一个 -C 选项便宜)。



      分支无关




      使用分支而不是目录很容易解决这个问题,如果是的话,分支是什么上面列出的命令版本?


      分支在这里没有任何区别。

      在Git中,术语分支不明确。请参阅我们的意思是分支?对于 git diff ,然而,分支 name 仅仅解析为单个提交,即该分支的提示提交。



        ...-- o  -  o  -  o<  -  -  branch1 
      \
      o - o - o< - branch2

      小轮 o s分别表示提交。这两个分支名称就是指针,在Git中:它们指向一个特定的提交。名称 branch1 指向顶行最右边的提交,名称 branch2 指向最右边的提交底线。



      每个提交都在Git中指向其父代或父代(大多数提交只有一个父代,而合并提交仅仅是一个提交,更多的父母)。这就是我们称之为分支的提交的。分支 name 直接指向链的 tip 1



      你运行:

        $ git diff分支1分支2 

      Git所做的就是将每个名称解析为相应的提交。例如,如果 branch1 名称commit 1234567 ... branch2 names commit 89abcde ... ,这只是做同样的事情:

        $ git diff 1234567 89abcde 



      Git的差异需要两棵



      Git甚至不关心这些提交,真的。 Git只需要左侧或源代码树,右侧或目标树。这两棵树可以来自一个提交,因为一个提交命名一棵树:任何提交的是你提交时的源快照。它们可以来自分支,因为分支名称命名了一个提交,它命名了一棵树。其中一棵树可以来自Git的索引(又名暂存区又名缓存),因为索引基本上是扁平化的树。其中一棵树可以是您的工作树。一棵或两棵树甚至可以超出Git的控制范围(因此 - no-index 标志)。



      当然,Git只能分析两个文件



      如果运行 git diff --no-index / path / to / file1 / path / to / file2 ,Git将简单地区分这两个文件,即将它们视为一对。这完全绕过所有配对和重命名检测代码。如果在 - 没有重命名, - 查找重命名 --rename-threshold 等选项可以实现,你可以明确地区分文件路径,而不是目录(树)路径。对于一大组文件,这当然是痛苦的。






      1 通过这一点可以有更多的提交,但它仍然是链的尖端。而且,多个名称可以指向单个提交。我将这种情况描绘为:

        ...-- o  -  o < -  tip1 
      \
      o - o< - tip2,tip3

      请注意,后面不止一个分支名称实际上就是这些分支的全部。因此,两个底行提交都在 tip2 tip3 分支上,而两个顶行提交都在所有三个分支。尽管如此,每个分支 name 解析为一个,并且只有一个提交。



      2 其实,要使一个新的提交,Git只是简单地使用 git write-tree 将索引转换成树,然后创建一个提交该名称的树(并使用当前提交作为其父,并且有作者和提交者以及提交消息)。 Git使用现有索引的事实是,为什么您必须在提交之前将更新的工作树文件添加到索引中 git add



      有一些方便的捷径,可以告诉 git commit 向索引添加文件,例如 git commit -a git commit<路径> 。这些可能有点棘手,因为它们并不总是产生你可能期望的索引。请参阅 - 包含 vs - 仅选项至 git commit<路径> ,例如。他们还通过将主索引复制到新的临时索引来工作;这可能会有令人惊讶的结果,因为如果提交成功,临时索引将被复制回常规索引。


      git diff --no-index --no-prefix --summary -U4000 directory1 directory2

      This works as expected in that it returns a diff of all the files between the two directories. Files that are added output as expected, files that are deleted also result in the expected diff output.

      However because the diff takes into account the file path as part of the file name, files with the same name, in the two different directories, result in a diff output with the renamed flag instead of changed.

      1. Is there a way to tell git to not take into account the full file path in the diff and only look at the file name, as if the files were originating from the same directory?

      2. Is there a way for git to actually know if a copy of the same file in a different directory was actually renamed? I don't see how, unless it has a way of comparing the files md5s somehow or something (probably a bad guess lol).

      3. Would using branches instead of directories resolve this issue easily and if so what is the branch version of the command listed above?

      解决方案

      There are multiple questions here, whose answers intertwine. Let's start with rename and copy detection, then move on to branches.

      Rename detection

      However because the diff takes into account the file path as part of the file name, files with the same name, in the two different directories, result in a diff output with the renamed flag instead of changed.

      This is not quite right. (The text below is meant to address both your items 1 and 2.)

      Although you are using --no-index (presumably, to make Git work on directories outside the repository), Git's diff code behaves the same way in all cases. In order to diff (compare) two files in two trees, Git must first determine file identity. That is, there are two sets of files: those in the "left side" or source tree (the first directory name), and those in the "right side" or destination tree (the second directory name). Some files on the left are the same file as some files on the right. Some files on the left are different files that have no corresponding right-side file, i.e., they have been deleted. Finally, some files on the right side are new, i.e., they have been created.

      Files that are "the same file" need not have the same path name. In this case, those files have been renamed.

      Here's how it works in detail. Note that "full path name" is modified somewhat when using git diff --no-index dir1 dir2: the "full path name" is what is left after stripping off the dir1 and dir2 prefixes.

      When comparing the left and right side trees, files that have the same full path names are normally automatically considered "the same file". We place all these files into a queue of "files to be diffed", and none will show up as being renamed. Note the word "normally" here—we'll come back to this in a moment.

      This leaves us with two remaining lists of files:

      • paths that exist on the left, but not the right: source without destination
      • paths that exist on the right, but not the left: destination without source

      Naïvely, we can simply declare that all of these source-side files have been deleted, and all of these destination files have been created. You can instruct git diff to behave this way: set the --no-renames flag to disable rename detection.

      Or, Git can go on to use a smarter algorithm: set the --find-renames and/or -M <threshold> flag to do this. In Git versions 2.9 and later, rename detection is on by default.

      Now, how shall Git decide that a source file has the same identity as a destination file? They have different paths; which right-side file does a/b/c.txt on the left correspond to? It might be d/e/f.bin, or d/e/f.txt, or a/b/renamed.txt, and so on. The actual algorithm is relatively simple, and in the past did not take final name component into effect (I'm not sure if it does now, Git is constantly evolving):

      • If there are source and destination files whose contents match exactly, pair them. Because Git hashes contents, this comparison is very fast. We can compare left-side a/b/c.txt by its hash ID to every file on the right, simply by looking at all of their hash IDs. Therefore, we run through all source files first, finding destination files that match, putting the new pairs into the diff queue and pulling them out of the two lists.

      • For all remaining source and destination files, run an efficient, but unsuitable for git diff output, algorithm to compute "file similarity". A source file that is at least <threshold> similar to some destination file causes a pairing, and that file-pair is removed. The default threshold is 50%: if you have enabled rename detection without choosing a particular threshold, two files that are still in the lists by this point, and are 50% similar, get paired.

      • Any remaining files are either deleted or created.

      Now that we have found all pairings, git diff proceeds to diff the paired, same-identity files, and tells us that deleted files are deleted, and newly-created files are created. If the two path names for same-identity files differ, git diff says the file is renamed.

      The arbitrary-file-pairing code is expensive (even though the same-name-gives-a-pair code is very cheap), so Git has a limit on how many names go into these pairing source and destination lists. That limit is configured through git config diff.renameLimit. The default has climbed over the years and is now several thousand files. You can set it to 0 (zero) to make Git use its own internal maximum at all times.

      Breaking pairs

      Above, I said that normally, files with the same name are paired automatically. This is usually the right thing to do, so it is Git's default. In some cases, however, the left-side file that is named a/b/c.txt is actually not related to the right-side file named a/b/c.txt, it's really related to the right-side a/doc/c.txt for instance. We can tell Git to break pairings of files that are "too different".

      We saw the "similarity index" used above to form pairings of files. This same similarity index can be used to split files: -B20%/60%, for instance. The two numbers need not add up to 100% and you can actually omit either one, or both: there's a default value for each if you set -B mode.

      The first number is the point at which a default-already-paired file can be put into the rename detection lists. With -B20%, if the files are 20% dis-similar (i.e., only 80% similar), the file goes into the "source for renames" list. If it never gets taken as a rename, it can re-pair with its automatic destination—but at this point, the second number, the one after the slash, takes effect.

      The second number sets the point at which a pairing is definitely broken. With -B/70%, for instance, if the files are 70% dis-similar (i.e., only 30% similar), the pairing is broken. (Of course, if the file was taken away as a rename source, the pairing is already broken.)

      Copy detection

      Besides the usual pairing and rename detection, you can ask Git to find copies of source files. After running all the usual pairing code, including finding renames and breaking pairs, if you have specified -C, Git will look for "new" (i.e., unpaired) destination files that are actually copied from existing sources. There are two modes for this, depending on whether you specify -C twice or add --find-copies-harder: one considers only source files that are modified (that's the single -C case), and one that considers every source file (that's the two -C or --find-copies-harder case). Note that this "was a source file modified" means, in this case, that the source file is already in the paired queue—if not, it's not "modified" by definition—and its corresponding destination file has a different hash ID (again, this is a very low-cost test, which helps keep a single -C option cheap).

      Branches don't matter

      Would using branches instead of directories resolve this issue easily and if so what is the branch version of the command listed above?

      Branches make no difference here.

      In Git, the term branch is ambiguous. See What exactly do we mean by "branch"? For git diff, though, a branch name simply resolves to a single commit, namely the tip commit of that branch.

      I like to draw Git's branches like this:

      ...--o--o--o   <-- branch1
               \
                o--o--o   <-- branch2
      

      The small round os each represent a commit. The two branch names are simply pointers, in Git: they point to one specific commit. The name branch1 points to the rightmost commit on the top line, and the name branch2 points to the rightmost commit on the bottom line.

      Each commit, in Git, points back to its parent or parents (most commits have just one parent, while a merge commit is simply a commit with two or more parents). This is what forms the chain of commits that we also call "a branch". The branch name points directly to the tip of a chain.1

      When you run:

      $ git diff branch1 branch2
      

      all that Git does is resolve each name to its corresponding commit. For instance, if branch1 names commit 1234567... and branch2 names commit 89abcde..., this just does the same thing as:

      $ git diff 1234567 89abcde
      

      Git's diff takes two trees

      Git does not even care that these are commits, really. Git just needs a left side or source tree, and a right side or destination tree. These two trees can come from a commit, because a commit names a tree: the tree of any commit is the source snapshot taken when you made that commit. They can come from a branch, because a branch-name names a commit, which names a tree. One of the trees can come from Git's "index" (aka "staging area" aka "cache"), as the index is basically a flattened tree.2 One of the trees can be your work-tree. One or both trees can even be outside of Git's control (hence the --no-index flag).

      Of course, Git can just diff two files

      If you run git diff --no-index /path/to/file1 /path/to/file2, Git will simply diff the two files, i.e., treat them as a pair. This bypasses all the pairing and rename-detecting code entirely. If no amount of fiddling with --no-renames, --find-renames, --rename-threshold, etc., options does the trick, you can explicitly diff file paths, rather than directory (tree) paths. For a large set of files, this will, of course, be painful.


      1There can be more commits past that point, but it's still the tip of its chain. Moreover, multiple names can point to a single commit. I draw this situation as:

      ...--o--o   <-- tip1
               \
                o--o   <-- tip2, tip3
      

      Note that commits that are "behind" more than one branch name are, in fact, on all of those branches. So both bottom-row commits are on both tip2 and tip3 branches, while both top-row commits are on all three branches. Nonetheless, each branch name resolves to one, and only one, commit.

      2In fact, to make a new commit, Git simply converts the index, just as it stands right now, into a tree using git write-tree, and then makes a commit that names that tree (and that uses the current commit as its parent, and has an author and committer, and a commit message). The fact that Git uses the existing index is why you must git add your updated work-tree files into the index before committing.

      There are some convenience short-cuts that let you tell git commit to add files to the index, e.g., git commit -a or git commit <path>. These can be a bit tricky as they don't always produce the index you might expect. See the --include vs --only options to git commit <path>, for instance. They also work by copying the main index to a new, temporary index; and this can have surprising results, because if the commit succeeds, the temporary index is copied back over the regular index.

      这篇关于两个目录中的相同文件的Git Diff总是导致“重命名”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆