git如何记录--follow< filename>工作? [英] how does git log --follow <filename> work?

查看:58
本文介绍了git如何记录--follow< filename>工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为文件的历史记录选择一个ID –我希望它是ID或引用其详细信息git log --follow <filename>的对象".我想知道:

Im trying to pick an id for a file's history -- i'd like it to be or refer to the "object" whose details git log --follow <filename>. I'm wondering:

git如何知道一个文件是后续提交中另一个文件的变体?当然,相同的名称是一个很强的暗示,但是它也跟踪提交时的重命名.它会将计算结果保存到某个地方,以便git log引用(在哪里?),还是git log每次都重复这些计算? (这些是什么计算?)

How does git know that one file is a variant of another across subsequent commits? The name being the same is a strong hint, of course, but it also tracks renaming on commit. Does it keep the results of its calculations somewhere for git log to refer to (where?), or does git log repeat these calculations every time? (And what calculations are these?)

理想情况下,我想使用nodegit访问或重新创建历史记录(提交列表/blos shas).

Ideally I'd like to access or recreate the history (list of commits/blob shas) with nodegit.

推荐答案

我和其他人在其他地方(例如将文件内容修改分配给文件路径的git启发法是什么?还是我对 Git在两个目录中相同文件的回答总是导致重命名" . git log --follow的详细信息与git diff的详细信息略有不同,因为git diff通常处理完整的树,其中充满了左右文件,但是git log --follow仅适用于一个特定路径. 1

Both other people and I have described this in different (and no links) detail elsewhere, e.g., this answer to What's git's heuristic for assigning content modifications to file paths? or my answer to Git Diff of same files in two directories always result in "renamed". The details are slightly different for git log --follow than they are for git diff, since git diff usually deals with an entire tree-full of left and right side files, but git log --follow only works with one particular path.1

在任何情况下,比较两个特定的提交时都会发生重命名跟随.对于一般的git diff,它们是 any 两次提交 R (右侧)和 L (左侧-您选择两者), 2 ,但对于git log,它们是父级和子级.为了方便起见,我们将它们称为 P C .使用git log --follow,Git运行后差异步骤(从diff_tree_sha1调用;请参见脚注),将所有内容修剪到一个文件中.差异是通过 R = C L = P 完成的.不过,一般情况实际上更容易描述,所以我们从这一点开始.

In any case, the rename-following happens when comparing two specific commits. For a general git diff they are any two commits R (right side) and L (left side—you choose the two),2 but for git log they are specifically parent and child. Let's call these P and C for convenience. With git log --follow, Git runs an after-diff step (called from diff_tree_sha1; see footnotes) that trims everything down to one file. The diff is done with R=C and L=P. The general case is actually easier to describe, though, so we'll start with that.

通常,当比较 R L 时,Git:

Normally, when comparing R vs L, Git:

  • 匹配具有相同完整路径名的所有树文件,然后
  • 将其余文件(路径)放入配对队列.

您可以使用-B(对- b reaking)标志进行一些修改,该标志实际上需要两个可选的整数(-Bn/m).只有 n 整数才可以进行重命名检测. 3 您也可以使用-C标志对其进行修改.这仅需要一个可选的 n ,并打开复制检测.在所有情况下,都必须打开重命名检测.通过-M启用重命名检测,它同样采用可选的整数 n ,或者在git log --follow和其他命令(例如git status或合并后的git diff --stat)中自动进行

You can modify this a bit with the -B (pair-breaking) flag, which actually takes two optional integers (-Bn/m). Only the n integer matters for rename detection.3 You can modify it with the -C flag as well; this takes only an optional n, and turns on copy detection. In all cases rename detection must be turned on. Rename detection is enabled via -M, which likewise takes an optional integer n, or automatically in the case of git log --follow and other commands like git status or the post-merge git diff --stat.

无论如何,整数 n 此处是所有这些各种选项的相似度(或不相似度)度量值.这是我们了解重命名检测代码的地方.

In any case, the integer n here is a similarity (or dissimilarity) metric value for all of these various options. This is where we get to the meat of the rename detection code.

假设首先要有一个基本的git diff <commit1> <commit2>git diff <tree1> <tree2>操作.这结束了在builtin/diff.c builtin_diff_tree ,它会调用diff_tree_sha1(稍后我们会再次看到),然后调用 diffcore_std,如果选择了正确的(-B-M-C-B,并再次选择)选项,它将运行diffcore_breakdiffcore_renamediffcore_merge_broken函数.

Suppose that we have, to start with, a basic git diff <commit1> <commit2> or git diff <tree1> <tree2> operation. This winds up calling builtin_diff_tree in builtin/diff.c, which calls diff_tree_sha1 (which we'll see again later) and then log_tree_diff_flush in log-tree.c. This almost immediately calls diffcore_std in diff.c, which runs the the diffcore_break, diffcore_rename, and diffcore_merge_broken functions if the right (-B, -M or -C, and -B again) options are selected.

这三个功能在配对队列上运行.配对队列如何设置?由于复杂,我将其留给另一部分.现在,只要在 L R 中都存在path/to/file的情况下,假设配对队列已经具有与path/to/file匹配的path/to/file,否则具有未配对的path/to/L-onlypath/to/R-only,用于仅在 L 或仅在 R 中出现文件路径的情况.

These three functions operate on a pairing queue. How does the pairing queue get set up? I'll leave that to another section since it's complicated. For now, just assume that the pairing queue already has path/to/file matched with path/to/file when there's a path/to/file in both L and R, and otherwise has an unpaired path/to/L-only and path/to/R-only for cases where there's a file-path that occurs only in L or only in R.

diffcore_break函数位于diffcore-break.c .它的工作是查找已经配对的文件,它们的 dis 相似性索引(在比较 L R 版本时)超过某个阈值.如果是这样,它将中断配对. diffcore_merge函数在同一文件中的正下方.如果没有一半找到更好的伴侣",它将重新加入分手的配对.差异度指标的计算与相似度计算类似,但不相同. 4

The diffcore_break function is in diffcore-break.c. Its job is to find already-paired files whose dissimilarity index (when comparing the L and R versions) is above some threshold. If that's the case, it breaks the pairing. The diffcore_merge function is just below it in the same file; it rejoins a broken-up pair if neither half has found a "better mate". The dissimilarity index computation is similar to, but not the same as, the similarity computation.4

更有趣的 diffcore_rename函数位于 .它具有 --follow的特殊情况快捷方式我们现在可以忽略的.然后,它会查找精确重命名,即文件即使它们的名称不匹配,其blob哈希也匹配.如果多个 L 源与某些未配对的 R 目标也具有相同的哈希,则有些使用下一个文件"的地方也很奇怪.

The more interesting diffcore_rename function is in diffcore-rename.c. It has a special case shortcut for --follow that we can ignore for now. It then looks for exact renames, i.e., files whose blob hashes match, even though their names don't. There are some fiddly bits for using "the next file" if multiple L sources have the same hash as some unpaired R destination, too.

接下来,它检查有多少未配对的条目,因为它将(实际上)用num( L )乘以num( R )个文件的比较来计算它们的相似性,这将占用大量时间和空间.甚至会自动降级太难"的--find-copies-harder机箱.然后,对于每个可能的 L R 配对,它都会计算

Next, it checks how many unpaired entries there are, because it is going to (in effect) do num(L) times num(R) comparisons of files to compute their similarities, and this is going to take a lot of time and space. It will even automatically downgrade a --find-copies-harder case that is "too hard". Then, for each possible L and R pairing, it computes a similarity index and a name score.

相似性索引代码在 estimate_similarity中c45> .它依赖于diffcore-delta.c中的 diffcore_count_changes a>,这样说(由于它是核心指标之一,因此我直接从文件中复制它):

The similarity index code is in estimate_similarity in diffcore-rename.c. It relies on the function diffcore_count_changes in diffcore-delta.c, which says this (I'm copying it straight from the file since it's one of the core metrics):

 * Idea here is very simple.
 *
 * Almost all data we are interested in are text, but sometimes we have
 * to deal with binary data.  So we cut them into chunks delimited by
 * LF byte, or 64-byte sequence, whichever comes first, and hash them.
 *
 * For those chunks, if the source buffer has more instances of it
 * than the destination buffer, that means the difference are the
 * number of bytes not copied from source to destination.  If the
 * counts are the same, everything was copied from source to
 * destination.  If the destination has more, everything was copied,
 * and destination added more.
 *
 * We are doing an approximation so we do not really have to waste
 * memory by actually storing the sequence.  We just hash them into
 * somewhere around 2^16 hashbuckets and count the occurrences.

不过这里有个秘密:相似索引会忽略\r字符(如果文件被视为非二进制"),并且\r后紧跟\n .

There's a secret bit here though: the similarity index ignores \r characters if the file is considered "not binary" and the \r is immediately followed by \n.

最终相似度得分是:

score = (int)(src_copied * MAX_SCORE / max_size);

其中,src_copied是在源中出现然后在目标中再次出现的哈希块(64字节或最新换行符)的数量,而max_size是字节中大小的大小(以字节为单位) Blob较大. (此字节数不考虑剥离的'\r'字符.这些字符仅从正被散列的64个或以上的新行块中删除.)

where src_copied is the number of hashed chunks (of 64 bytes or up-to-newline) that occurred in the source and then occurred again in the destination, and max_size is the size, in bytes, of whichever blob is larger. (This byte count does not account for stripped '\r' characters. Those are merely removed from the 64-or-up-to-newline chunks being hashed.)

名称分数"实际上只是1(相同的基本名称)或0(不同的基本名称),即如果 L 文件为dir/oldbase R 文件为differentdir/oldbase,但如果 L 文件为dir/oldbase并且 R 文件为anything/newbase,则为0.当这两个文件相同时,这用于使Git比anything/newbase更喜欢newdir/oldbase.

The "name score" is really just 1 (same base name) or 0 (different base name), i.e., 1 if the L file is dir/oldbase and the R file is differentdir/oldbase, but 0 if the L file is dir/oldbase and the R file is anything/newbase. This is used to make Git favor newdir/oldbase over anything/newbase when those two files are equally similar.

diff_tree_sha1代码调用(通过一系列函数) diff_change diff_addremove ,都在diff.c中.这些调用 diff_queue ,将文件对(如果文件是新文件或已删除文件,则其中之一是虚拟文件).

The diff_tree_sha1 code calls (through a series of functions) ll_diff_tree_paths (both are in in tree-diff.c; I linked only to the final function here). This is a complicated and extremely optimized bit of code (Git spends a lot of time here) so we'll just do a quick overview and ignore the complications (see footnote 2). This code looks partly at the full path names of each blob in each tree (these are the P1,...,Pn items in the comment at the top), and partly at the blob hashes for each of these names. For files that have the same name and the same contents, it does nothing (except in --find-copies-harder mode in which case it queues all file names). For files that have the same name and different contents, or no L or R name, it calls (through function pointers, stored in opt->pathchange, opt->change, and opt->add_remove) what eventually boil down to diff_change or diff_addremove, both in diff.c. These call diff_queue, which put the file pair (one of which is a dummy if the file is new or removed) into the pairing queue.

因此是简短版本(如果我们不使用-C--find-copies-harder),则仅当 L中没有原始源文件时,配对队列才具有 unpaired 文件. R 中的文件相对应,或者 R 中没有与 L 中的源文件相对应的目标文件.使用-C,还列出了每个源文件或每个修改后的源文件,以便可以对其进行扫描以进行复制(此处的选择取决于您是否使用--find-copies-harder).

Hence, the short version (if we're not using -C or --find-copies-harder), the pairing queue has unpaired files only when there is no original source file in L corresponding to a file in R, or no destination file in R corresponding to a source file in L. With -C, it has every source file, or every modified source file, listed as well so that they can be scanned for copies (the choice here being based on whether you used --find-copies-harder).

我们已经在diffcore-rename.c代码中注意到了一个快捷方式:它跳过了所有 R 文件名,这些文件名不是我们关心的文件名. ll_diff_tree_paths中似乎有一些类似的hack,尽管我不确定它们是否适用于此.如脚注2所示,代码的驱动方式也不同.当我们比较父 P 与子 C 并在配对队列中找到一个重命名时,我们然后在git log -- <path>中切换出我们用作限制文件的名称:我们将 C 中的新名称替换为 P .然后我们继续像往常一样继续进行比较,因此下一次比较 P 和- C 对时,我们正在寻找oldpath而不是newpath.如果我们检测到oldpath是从reallyoldpath重命名的,则像以前一样将其再次切换到适当的位置.

We already noted a shortcut in the diffcore-rename.c code: it skips over all R file names that are not the one file name we care about. There seem to be some similar hacks in ll_diff_tree_paths, although I am not sure whether they apply here. The code is also driven in a different way, as noted in footnote 2. When we diff parent P vs child C, and find a rename left in our pairing queue, we then switch out the name of the file we're using as a restriction in our git log -- <path>: we replace the new name in C with the path for the rename-source in P. Then we just continue diffing as usual, so the next time we compare a P-and-C pair, we're looking for oldpath instead of newpath. If we detect oldpath as being renamed from reallyoldpath, we switch that name into place again, as before.

请注意,所有-B-C-M机器在理论上都适用 ,但是快捷方式可能(对我来说是否完全不明确)请保留其中的一些内容.它无法正常工作.

Note that all the -B, -C, and -M machinery applies in theory, but the shortcuts may—it's not at all clear to me whether they do—keep some of it from working.

1 使用--follow时,Git使用常规的diffcore代码来运行成对中断和复制检测.通用代码是从要进行简化的代码中调用的.请参见tree-diff.c 中的函数 try_to_follow_renames ,在diff.c 中调用 diffcore_std.所有这些最终都调用 diff_resolve_rename_copy 来处理配对队列.然后try_to_follow_renames将结果修剪为一个有趣的文件;稍后通过 diff_might_be_rename 进行了测试来自 diff_tree_sha1 .我认为这全都来自 log_tree_commit ,从 cmd_log_walk log_show_early .这最后一个似乎是未记录的黑客,旨在供某些GUI使用.

1When using --follow, Git uses the general diffcore code to run both pair-breaking and copy-detection. The general code is called from the code that wants to do the simplification. See the function try_to_follow_renames in tree-diff.c, which calls diffcore_std in diff.c. This all eventually calls diff_resolve_rename_copy which handles the pairing queues. Then try_to_follow_renames trims the result down to the one interesting file; this is later tested via diff_might_be_rename as called from diff_tree_sha1. I think this is all driven from log_tree_commit, called from either cmd_log_walk or log_show_early. This last appears to be an undocumented hack meant for use by some GUI(s).

2 git diff中的树匹配实际上在输出的右侧接受单个提交,在输入的左侧接受 list 提交,用于组合差异目的.这就是Git设法显示合并提交的方式.不过,尚不清楚--follow如何与合并提交一起使用.请参见Combine-diff.c中的 find_paths_generic ,它也会调用diff_tree_sha1.请注意,log --follow骇客是由于调用diff_tree_sha1而发生的,并且此组合差异合并处理代码每个父级 都会调用该函数一次.它在通过第二个父级时已被更改.也许这是一个错误.如果第二个父级决定新名称导致另一个不同的重命名,会发生什么情况?从逻辑上讲,它应该按拓扑顺序为每个父叉最多选择一个新名称,并考虑在叉子何时以及何时重新加入时以某种方式再次解决它们.

2The tree matching in git diff actually accepts a single commit on the output right-hand side, and a list of commits on the input left-hand side, for combined diff purposes. This is how Git manages to show merge commits. It's a bit unclear how --follow works with merge commits though. See find_paths_generic in combine-diff.c, which calls diff_tree_sha1 as well. Note that the log --follow hack happens as a result of calling diff_tree_sha1, and this combined-diff merge-handling code calls that function once per parent. If the followed name will be changed, though, it has been changed by the time it goes through the second parent. Perhaps this is a bug. What happens if that second parent decides the new name results in another, different rename? Logically, it should choose up to one new name per parent-fork, working in topological order, and consider resolving them again somehow if and when the forks rejoin.

3 -Bn/m中的第二个 m 值告诉Git何时不运行实际的差异,而是仅描述 non -重命名为删除所有原始行,并用所有新行替换".假定第一个-B值以 not 结尾而不是破坏配对,或者由于-M值将配对重新粘合在一起,或者以-C复制.

3The second, m, value in -Bn/m tells Git when not to run a real diff, and instead just describe the change in a non-renamed file as "delete all the original lines, replace them with all the new lines". That assumes that either the first -Bvalue ended up not breaking the pairing, or the pairing was re-glued-together due to the -M value, or glued to a different source as a -C copy.

4 请参见 了解详情.这也使用diffcore-delta.c代码,但是以不同的方式使用添加"计数.

4See should_break for details. This also uses the diffcore-delta.c code, but in a different way, using the "added" count.

这篇关于git如何记录--follow&lt; filename&gt;工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆