两个目录中相同文件的 Git Diff 总是导致“重命名" [英] Git Diff of same files in two directories always result in "renamed"

查看:49
本文介绍了两个目录中相同文件的 Git Diff 总是导致“重命名"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

git diff --no-index --no-prefix --summary -U4000 directory1 directory2

git diff --no-index --no-prefix --summary -U4000 directory1 directory2

这按预期工作,因为它返回两个目录之间所有文件的差异.添加的文件按预期输出,删除的文件也会导致预期的差异输出.

This works as expected in that it returns a diff of all the files between the two directories. Files that are added output as expected, files that are deleted also result in the expected diff output.

然而,由于差异将文件路径作为文件名的一部分,两个不同目录中的同名文件会导致差异输出带有重命名的标志而不是更改.

However because the diff takes into account the file path as part of the file name, files with the same name, in the two different directories, result in a diff output with the renamed flag instead of changed.

  1. 有没有办法告诉 git 不要考虑 diff 中的完整文件路径而只查看文件名,就好像这些文件来自同一目录一样?

  1. Is there a way to tell git to not take into account the full file path in the diff and only look at the file name, as if the files were originating from the same directory?

git 有没有办法真正知道不同目录中相同文件的副本是否真的被重命名了?我不知道如何,除非它有办法以某种方式比较文件 md5s(可能是一个错误的猜测,哈哈).

Is there a way for git to actually know if a copy of the same file in a different directory was actually renamed? I don't see how, unless it has a way of comparing the files md5s somehow or something (probably a bad guess lol).

使用分支而不是目录是否可以轻松解决此问题,如果可以,上面列出的命令的分支版本是什么?

Would using branches instead of directories resolve this issue easily and if so what is the branch version of the command listed above?

推荐答案

这里有多个问题,答案交织在一起.让我们从重命名和复制检测开始,然后转到分支.

There are multiple questions here, whose answers intertwine. Let's start with rename and copy detection, then move on to branches.

然而,由于差异将文件路径作为文件名的一部分,两个不同目录中的同名文件会导致差异输出带有重命名标志而不是更改.

However because the diff takes into account the file path as part of the file name, files with the same name, in the two different directories, result in a diff output with the renamed flag instead of changed.

这不太对.(以下文字旨在解决您的第 1 项和第 2 项.)

This is not quite right. (The text below is meant to address both your items 1 and 2.)

尽管您使用了 --no-index(大概是为了让 Git 在存储库外的目录上工作),但 Git 的 diff 代码在所有情况下的行为方式都相同.为了区分(比较)两棵树中的两个文件,Git 必须首先确定文件身份.也就是说,有两组文件:位于左侧"或树(第一个目录名称)中的文件,以及位于右侧"或目标树中的文件> 树(第二个目录名).左边的一些文件与右边的一些文件相同的文件.左边的一些文件是不同的文件,没有对应的右边文件,即它们被删除.最后,右侧的一些文件是新的,即它们是创建的.

Although you are using --no-index (presumably, to make Git work on directories outside the repository), Git's diff code behaves the same way in all cases. In order to diff (compare) two files in two trees, Git must first determine file identity. That is, there are two sets of files: those in the "left side" or source tree (the first directory name), and those in the "right side" or destination tree (the second directory name). Some files on the left are the same file as some files on the right. Some files on the left are different files that have no corresponding right-side file, i.e., they have been deleted. Finally, some files on the right side are new, i.e., they have been created.

相同文件"的文件不需要具有相同的路径名.在这种情况下,这些文件已重命名.

Files that are "the same file" need not have the same path name. In this case, those files have been renamed.

以下是它的详细工作原理.请注意,在使用 git diff --no-index dir1 dir2 时,完整路径名"有所修改:完整路径名"是去除 dir1 后剩下的> 和 dir2 前缀.

Here's how it works in detail. Note that "full path name" is modified somewhat when using git diff --no-index dir1 dir2: the "full path name" is what is left after stripping off the dir1 and dir2 prefixes.

比较左右侧树时,具有相同全路径名的文件通常自动被认为是同一个文件".我们将所有这些文件放入要比较的文件"队列中,没有一个会显示为已重命名.注意这里的通常"这个词——我们稍后会回到这个话题.

When comparing the left and right side trees, files that have the same full path names are normally automatically considered "the same file". We place all these files into a queue of "files to be diffed", and none will show up as being renamed. Note the word "normally" here—we'll come back to this in a moment.

这给我们留下了两个剩余的文件列表:

This leaves us with two remaining lists of files:

  • 存在于左侧但不存在于右侧的路径:源没有目的地
  • 存在于右侧但不存在于左侧的路径:没有源的目的地

天真地,我们可以简单地声明所有这些源端文件都已删除,并且所有这些目标文件都已创建.您可以指示 git diff 以这种方式运行:设置 --no-renames 标志以禁用重命名检测.

Naïvely, we can simply declare that all of these source-side files have been deleted, and all of these destination files have been created. You can instruct git diff to behave this way: set the --no-renames flag to disable rename detection.

或者,Git 可以继续使用更智能的算法:设置 --find-renames 和/或 -M 标志来执行此操作.在 Git 2.9 及更高版本中,重命名检测默认处于开启状态.

Or, Git can go on to use a smarter algorithm: set the --find-renames and/or -M <threshold> flag to do this. In Git versions 2.9 and later, rename detection is on by default.

现在,Git 将如何确定源文件与目标文件具有相同的身份?他们有不同的路径;左边的a/b/c.txt对应的是哪个右边的文件?它可能是 d/e/f.bin,或 d/e/f.txt,或 a/b/renamed.txt,等等.实际的算法比较简单,过去没有把final name组件生效(现在不知道有没有了,Git在不断进化):

Now, how shall Git decide that a source file has the same identity as a destination file? They have different paths; which right-side file does a/b/c.txt on the left correspond to? It might be d/e/f.bin, or d/e/f.txt, or a/b/renamed.txt, and so on. The actual algorithm is relatively simple, and in the past did not take final name component into effect (I'm not sure if it does now, Git is constantly evolving):

  • 如果存在内容完全匹配的源文件和目标文件,请将它们配对.因为 Git 对内容进行哈希处理,所以这种比较非常快.我们可以通过哈希 ID 将左侧的 a/b/c.txt 与右侧的每个文件进行比较,只需查看所有他们的em> 哈希 ID.因此,我们首先遍历所有源文件,找到匹配的目标文件,将新对放入差异队列并将它们从两个列表中拉出.

  • If there are source and destination files whose contents match exactly, pair them. Because Git hashes contents, this comparison is very fast. We can compare left-side a/b/c.txt by its hash ID to every file on the right, simply by looking at all of their hash IDs. Therefore, we run through all source files first, finding destination files that match, putting the new pairs into the diff queue and pulling them out of the two lists.

对于所有剩余的源文件和目标文件,运行一个高效但不适合git diff输出的算法来计算文件相似度".与某些目标文件至少 相似的源文件会导致配对,并且删除该文件对.默认阈值为 50%:如果您在未选择特定阈值的情况下启用了重命名检测,则此时仍位于列表中且具有 50% 相似度的两个文件将配对.

For all remaining source and destination files, run an efficient, but unsuitable for git diff output, algorithm to compute "file similarity". A source file that is at least <threshold> similar to some destination file causes a pairing, and that file-pair is removed. The default threshold is 50%: if you have enabled rename detection without choosing a particular threshold, two files that are still in the lists by this point, and are 50% similar, get paired.

删除或创建任何剩余文件.

Any remaining files are either deleted or created.

现在我们已经找到了所有的配对,git diff 继续对配对的相同身份文件进行比较,并告诉我们删除的文件被删除,新创建的文件被创建.如果相同身份文件的两个路径名不同,git diff 表示文件已重命名.

Now that we have found all pairings, git diff proceeds to diff the paired, same-identity files, and tells us that deleted files are deleted, and newly-created files are created. If the two path names for same-identity files differ, git diff says the file is renamed.

任意文件配对代码很昂贵(即使同名给定配对代码非常便宜),因此 Git 对这些名称的数量有限制配对源列表和目标列表.该限制是通过 git config diff.renameLimit 配置的.多年来,默认值已经攀升,现在有数千个文件.您可以将其设置为 0(零)以使 Git 始终使用其自己的内部最大值.

The arbitrary-file-pairing code is expensive (even though the same-name-gives-a-pair code is very cheap), so Git has a limit on how many names go into these pairing source and destination lists. That limit is configured through git config diff.renameLimit. The default has climbed over the years and is now several thousand files. You can set it to 0 (zero) to make Git use its own internal maximum at all times.

上面,我说通常,同名文件会自动配对.这通常是正确的做法,因此它是 Git 的默认设置.然而,在某些情况下,名为a/b/c.txt 的左侧文件实际上与名为a 的右侧文件相关/b/c.txt,比如右侧的a/doc/c.txt.我们可以告诉 Git 打破太不同"的文件对.

Above, I said that normally, files with the same name are paired automatically. This is usually the right thing to do, so it is Git's default. In some cases, however, the left-side file that is named a/b/c.txt is actually not related to the right-side file named a/b/c.txt, it's really related to the right-side a/doc/c.txt for instance. We can tell Git to break pairings of files that are "too different".

我们看到了上面用来形成文件配对的相似性指数".相同的相似度索引可用于拆分文件:例如,-B20%/60%.这两个数字不需要加起来为 100%,您实际上可以省略一个或两个:如果您设置 -B 模式,每个数字都有一个默认值.

We saw the "similarity index" used above to form pairings of files. This same similarity index can be used to split files: -B20%/60%, for instance. The two numbers need not add up to 100% and you can actually omit either one, or both: there's a default value for each if you set -B mode.

第一个数字是可以将默认已配对文件放入重命名检测列表的点.使用 -B20%,如果文件有 20% 的不相似(即只有 80% 相似),则文件进入重命名源"列表.如果它永远不会被当作重命名,它可以与它的自动目的地重新配对——但此时,第二个数字,斜线后面的那个,生效.

The first number is the point at which a default-already-paired file can be put into the rename detection lists. With -B20%, if the files are 20% dis-similar (i.e., only 80% similar), the file goes into the "source for renames" list. If it never gets taken as a rename, it can re-pair with its automatic destination—but at this point, the second number, the one after the slash, takes effect.

第二个数字设置配对肯定被破坏的点.例如,对于 -B/70%,如果文件有 70% 的不相似(即只有 30% 相似),则配对将被破坏.(当然,如果文件被拿走作为重命名源,则配对已经中断.)

The second number sets the point at which a pairing is definitely broken. With -B/70%, for instance, if the files are 70% dis-similar (i.e., only 30% similar), the pairing is broken. (Of course, if the file was taken away as a rename source, the pairing is already broken.)

除了通常的配对和重命名检测之外,您还可以要求 Git 查找源文件的副本.在运行所有常用的配对代码后,包括查找重命名和断开对,如果您指定了 -C,Git 将查找实际从现有源复制的新"(即未配对)目标文件.有两种模式,这取决于您是指定两次 -C 还是添加 --find-copies-harder:一种只考虑修改过的源文件(这是单个 -C 的情况),以及考虑 每个 源文件(即两个 -C>--find-copies-harder 案例).请注意,在这种情况下,被修改的源文件"意味着源文件已经在配对队列中——如果不是,则根据定义它没有被修改"——其对应的目标文件具有不同的哈希 ID(同样,这是一个成本非常低的测试,有助于保持单个 -C 选项的成本低).

Besides the usual pairing and rename detection, you can ask Git to find copies of source files. After running all the usual pairing code, including finding renames and breaking pairs, if you have specified -C, Git will look for "new" (i.e., unpaired) destination files that are actually copied from existing sources. There are two modes for this, depending on whether you specify -C twice or add --find-copies-harder: one considers only source files that are modified (that's the single -C case), and one that considers every source file (that's the two -C or --find-copies-harder case). Note that this "was a source file modified" means, in this case, that the source file is already in the paired queue—if not, it's not "modified" by definition—and its corresponding destination file has a different hash ID (again, this is a very low-cost test, which helps keep a single -C option cheap).

使用分支而不是目录是否可以轻松解决此问题,如果是,上面列出的命令的分支版本是什么?

Would using branches instead of directories resolve this issue easily and if so what is the branch version of the command listed above?

分支在这里没有区别.

在 Git 中,术语 branch 是不明确的.请参阅 我们所说的分支"究竟是什么意思? 但是对于 git diff,一个分支 name 简单地解析为单个提交,即该分支的 tip 提交.

In Git, the term branch is ambiguous. See What exactly do we mean by "branch"? For git diff, though, a branch name simply resolves to a single commit, namely the tip commit of that branch.

我喜欢这样画 Git 的分支:

I like to draw Git's branches like this:

...--o--o--o   <-- branch1
         
          o--o--o   <-- branch2

小圆o每个代表一次提交.这两个分支名称只是指针,在 Git 中:它们指向一个特定的提交.名称 branch1 指向顶行最右边的提交,名称 branch2 指向底行最右边的提交.

The small round os each represent a commit. The two branch names are simply pointers, in Git: they point to one specific commit. The name branch1 points to the rightmost commit on the top line, and the name branch2 points to the rightmost commit on the bottom line.

在 Git 中,每个提交都指向其父项或父项(大多数提交只有一个父项,而合并提交只是具有两个或更多父项的提交).这就是构成我们也称为分支"的提交的原因.分支name直接指向链的tip.1

Each commit, in Git, points back to its parent or parents (most commits have just one parent, while a merge commit is simply a commit with two or more parents). This is what forms the chain of commits that we also call "a branch". The branch name points directly to the tip of a chain.1

跑步时:

$ git diff branch1 branch2

Git 所做的就是将每个名称解析为其相应的提交.例如,如果 branch1 名称 commit 1234567...branch2 名称 commit 89abcde...,这只是做同样的事情:

all that Git does is resolve each name to its corresponding commit. For instance, if branch1 names commit 1234567... and branch2 names commit 89abcde..., this just does the same thing as:

$ git diff 1234567 89abcde

Git 的差异需要两棵树

Git 甚至不在乎这些是提交,真的.Git 只需要一个左侧或源树,以及一个右侧或目标树.这两个树可能来自一次提交,因为提交命名了一棵树:任何提交的是您进行该提交时拍摄的源快照.它们可以来自一个分支,因为分支名称命名了一个提交,它命名了一个树.其中一棵树可以来自 Git 的索引"(又名暂存区"又名缓存"),因为索引基本上是一棵扁平化的树.2 其中一棵树可以是你的工作树.一棵树或两棵树甚至可能不受 Git 控制(因此有 --no-index 标志).

Git's diff takes two trees

Git does not even care that these are commits, really. Git just needs a left side or source tree, and a right side or destination tree. These two trees can come from a commit, because a commit names a tree: the tree of any commit is the source snapshot taken when you made that commit. They can come from a branch, because a branch-name names a commit, which names a tree. One of the trees can come from Git's "index" (aka "staging area" aka "cache"), as the index is basically a flattened tree.2 One of the trees can be your work-tree. One or both trees can even be outside of Git's control (hence the --no-index flag).

如果您运行 git diff --no-index/path/to/file1/path/to/file2,Git 将简单地比较这两个文件,即,将它们视为一对.这完全绕过了所有配对和重命名检测代码.如果没有过多地摆弄 --no-renames--find-renames--rename-threshold 等,则选项不会诀窍是,您可以明确区分文件路径,而不是目录(树)路径.对于大量文件,这当然会很痛苦.

If you run git diff --no-index /path/to/file1 /path/to/file2, Git will simply diff the two files, i.e., treat them as a pair. This bypasses all the pairing and rename-detecting code entirely. If no amount of fiddling with --no-renames, --find-renames, --rename-threshold, etc., options does the trick, you can explicitly diff file paths, rather than directory (tree) paths. For a large set of files, this will, of course, be painful.

1在那之后可以有更多的提交,但它仍然是链条的末端.此外,多个名称可以指向单个提交.我把这种情况画成:

1There can be more commits past that point, but it's still the tip of its chain. Moreover, multiple names can point to a single commit. I draw this situation as:

...--o--o   <-- tip1
         
          o--o   <-- tip2, tip3

请注意,在多个分支名称后面"的提交实际上位于所有这些分支上.所以底行提交都在 tip2tip3 分支上,而顶行提交都在所有三个分支上.尽管如此,每个分支 name 都解析为一个,并且只有一个提交.

Note that commits that are "behind" more than one branch name are, in fact, on all of those branches. So both bottom-row commits are on both tip2 and tip3 branches, while both top-row commits are on all three branches. Nonetheless, each branch name resolves to one, and only one, commit.

2实际上,要进行提交,Git 只需使用 git write- 将索引转换为树,就像现在一样,tree,然后进行命名该树的提交(并且使用当前提交作为其父级,并且具有作者和提交者以及提交消息).Git 使用现有索引的事实是您必须在提交之前 git add 将更新的工作树文件添加到索引中的原因.

2In fact, to make a new commit, Git simply converts the index, just as it stands right now, into a tree using git write-tree, and then makes a commit that names that tree (and that uses the current commit as its parent, and has an author and committer, and a commit message). The fact that Git uses the existing index is why you must git add your updated work-tree files into the index before committing.

有一些方便的快捷方式可以让您告诉 git commit 将文件添加到索引中,例如,git commit -agit commit <路径>.这些可能有点棘手,因为它们并不总是产生您可能期望的索引.例如,请参阅 --include--only 选项以git commit .它们还通过将主索引复制到新的临时索引来工作;这可能会产生令人惊讶的结果,因为如果提交成功,临时索引会被复制回常规索引.

There are some convenience short-cuts that let you tell git commit to add files to the index, e.g., git commit -a or git commit <path>. These can be a bit tricky as they don't always produce the index you might expect. See the --include vs --only options to git commit <path>, for instance. They also work by copying the main index to a new, temporary index; and this can have surprising results, because if the commit succeeds, the temporary index is copied back over the regular index.

这篇关于两个目录中相同文件的 Git Diff 总是导致“重命名"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆