当我们这样做时 git 会做什么:git gc - git prune [英] What does git do when we do : git gc - git prune

查看:26
本文介绍了当我们这样做时 git 会做什么:git gc - git prune的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

启动时后台发生了什么,

What's going on in background when launching,

  • git gc
  • git prune

git gc 的输出:

Counting objects: 945490, done. 
Delta compression using up to 4 threads.   
Compressing objects: 100% (334718/334718), done. 
Writing objects: 100%   (945490/945490), done. 
Total 945490 (delta 483105), reused 944529 (delta 482309) 
Checking connectivity: 948048, done.

git prune 的输出:

Checking connectivity: 945490, done.

这两个选项有什么区别?

What is the difference between these two options?

谢谢

推荐答案

TL;DR

git prune 仅删除 松散、无法访问、陈旧 对象(对象必须具有所有三个属性才能被修剪).无法访问的打包对象保留在其打包文件中.可触及的松散对象保持可触及和松散.无法访问但尚未陈旧的对象也保持不变.stale 的定义有点棘手(详见下文).

TL;DR

git prune only removes loose, unreachable, stale objects (objects must have all three properties to get pruned). Unreachable packed objects remain in their pack files. Reachable loose objects remain reachable and loose. Objects that are unreachable, but are not yet stale, also remain untouched. The definition of stale is a little tricky (see details below).

git gc 做得更多:它打包引用、打包有用的对象、过期 reflog 条目、修剪松散的对象、修剪已删除的工作树,以及修剪/gc 的旧 git rerere 数据.

git gc does more: it packs references, packs useful objects, expires reflog entries, prunes loose objects, prunes removed worktrees, and prunes / gc's old git rerere data.

我不确定您上面所说的在后台"是什么意思(background 在 shell 中具有技术意义,这里的所有活动都发生在 shell 的 前台em> 但我怀疑你的意思不是这些术语).

I'm not sure what you mean by "in the background" above (background has a technical meaning in shells and all of the activity here takes place in the shell's foreground but I suspect you did not mean these terms).

git gc所做的就是编排一整套收集活动,包括但不限于git prune.下面的列表是由前台运行的命令集 gc 没有 --auto (省略它们的参数,这在某种程度上取决于 git gc 论据):

What git gc does is to orchestrate a whole series of collection activities, including but not limited to git prune. The list below is the set of commands run by a foreground gc without --auto (omitting their arguments, which depend to some extent on git gc arguments):

  • git pack-refs:紧凑引用(转.git/refs/heads/....git/refs/tags/... 条目到 .git/packed-refs 中的条目中,消除单个文件)
  • git reflog expire:过期旧的 reflog 条目
  • git repack:将松散的对象打包成打包的对象格式
  • git prune:删除不需要的松散对象
  • git worktree prune:删除用户已删除的已添加工作树的工作树数据
  • git rerere gc:删除旧的rerere记录
  • git pack-refs: compact references (turn .git/refs/heads/... and .git/refs/tags/... entries into entries in .git/packed-refs, eliminating the individual files)
  • git reflog expire: expire old reflog entries
  • git repack: pack loose objects into packed object format
  • git prune: remove unwanted loose objects
  • git worktree prune: remove worktree data for added worktrees that the user has deleted
  • git rerere gc: remove old rerere records

git gc 还有一些单独的文件活动,但以上是主要序列.请注意,git prune 发生在 (1) 过期 reflog 和 (2) 运行 git repack 之后:这是因为已删除过期的 reflog 条目可能会导致对象变得未被引用,因此不会被打包然后被修剪,从而完全消失.

There are a few more individual file activities git gc does on its own, but the above is the main sequence. Note that git prune happens after (1) expiring reflogs and (2) running git repack: this is because an expired reflog entry that is removed may cause an object to become unreferenced, and hence not get packed and then get pruned so that it is completely gone.

在进入更多细节之前,最好先在 Git 中定义什么是 object,以及对象是 loose 意味着什么em>打包.我们还需要了解对象可达意味着什么.

Before going into any more detail, it's a good idea to define what an object is, in Git, and what it means for an object to be loose or packed. We also need to understand what it means for an object to be reachable.

每个对象都有一个哈希 ID(例如,您在 git log 中看到的那些大而丑陋的 ID 之一),即该对象的名称,用于检索目的.Git 将所有对象存储在一个键值数据库中,其中名称是键,对象本身就是值.因此,Git 的对象是 Git 存储文件和提交的方式,事实上,有四种 对象类型: commit 对象包含一个实际的提交.tree 对象包含成对的集合,1 是人类可读的名称,例如 READMEsubdir 以及另一个对象的哈希 ID.如果树中的名称是文件名,则该其他对象是 blob 对象,或者如果名称是子目录的名称,则它是另一个树对象.blob 对象包含实际的文件内容(但请注意,文件的 name 位于链接到 blob 的树中!).最后一个对象类型是annotated tag,用于带注释的标签,这里不是特别感兴趣.

Every object has a hash ID—one of those big ugly IDs you have seen in git log, for instance—that is that object's name, for retrieval purposes. Git stores all the objects in a key-value database where the name is the key, and the object itself is the value. Git's objects are therefore how Git stores files and commits, and in fact, there are four object types: A commit object holds an actual commit. A tree object holds sets of pairs,1 a human-readable name like README or subdir along with another object's hash ID. That other object is a blob object if the name in the tree is a file name, or it is another tree object if the name is that of a subdirectory. The blob objects hold the actual file contents (but note that the name of the file is in the tree linking to the blob!). The last object type is annotated tag, used for annotated tags, which are not especially interesting here.

一旦制作完成,任何物品都无法更改.这是因为对象的名称(它的哈希 ID)是通过查看对象内容的每一位来计算的.将任何一位从零更改为一,反之亦然,哈希 ID 也会发生变化:您现在有一个 不同的 对象,具有一个 不同的名称.这就是 Git 检查是否没有文件被弄乱过的方式:如果文件内容发生了变化,对象的哈希 ID 也会发生变化.对象 ID 存储在树条目中,如果树对象发生更改,树的 ID 也会更改.树的 ID 存储在提交中,如果树 ID 更改,则提交的哈希值也会更改.因此,如果您知道提交的哈希是 a234b67... 并且提交的内容仍然哈希为 a234b67...,则提交中没有任何变化,并且树 ID 是仍然有效.如果树仍然哈希到它自己的名字,它的内容仍然有效,所以 blob ID 是正确的;所以只要 blob 内容散列到它自己的名字,这个 blob 也是正确的.

Once made, no object can ever be changed. This is because the object's name—it hash ID—is computed by looking at every single bit of the object's content. Change any one bit from a zero to a one or vice versa and the hash ID changes: you now have a different object, with a different name. This is how Git checks that no file has ever been messed-with: if the file contents were changed, the hash ID of the object would change. The object ID is stored in the tree entry, and if the tree object were changed, the tree's ID would change. The tree's ID is stored in the commit, and if the tree ID were changed, the commit's hash would change. So if you know that the commit's hash is a234b67... and the commit's content still hashes to a234b67..., nothing changed in the commit, and the tree ID is still valid. If the tree still hashes to its own name, its content is still valid, so the blob ID is correct; so as long as the blob content hashes to its own name, the blob is correct as well.

对象可以是松散的,这意味着它们被存储为文件.文件名只是哈希 ID.2 松散对象的内容是 zlib-deflated.或者,对象可以打包,这意味着许多对象存储在单个打包文件中.在这种情况下,内容不仅是压缩的,它们首先是 delta-compressed.Git 挑选出一个 base 对象——通常是一些 blob(文件)的最新版本——然后找到可以表示为一系列命令的其他对象:获取基本文件,在此删除一些文本偏移,在另一个偏移处添加其他文本,等等.包文件的实际格式是记录在这里,如果有点轻.请注意,与大多数版本控制系统不同,增量压缩发生在存储对象抽象以下的级别:Git 存储整个快照,然后在稍后进行增量压缩底层对象.Git 仍然通过其哈希 ID 名称访问对象;只是读取该对象涉及读取包文件、查找对象及其底层 delta 基础,并即时重建完整的对象.

Objects can be loose, which means they are stored as files. The name of the file is just the hash ID.2 The contents of the loose object are zlib-deflated. Or, objects can be packed, which means many objects are stored in a single pack-file. In this case the contents are not just deflated, they're first delta-compressed. Git picks out a base object—often the latest version of some blob (file)—and then finds additional objects that can be represented as a series of commands: take the base file, remove some text at this offset, add other text at another offset, and so on. The actual format of pack files is documented here, if a bit lightly. Note that unlike most version control systems, the delta-compression occurs at a level below the stored-object abstraction: Git stores whole snapshots, then does delta-compression later, on the underlying objects. Git still accesses an object by its hash-ID name; it's just that reading that object involves reading the pack file, finding the object and its underlying delta bases, and reconstructing the complete object on the fly.

关于包文件有一条通用规则,规定包文件中的任何增量压缩对象都必须在同一个包文件中具有其所有基础.这意味着一个包文件是自包含的:永远不需要打开多个额外的包文件来从包含该对象的包中取出一个对象.(此特定规则可能会被故意违反,从而产生 Git 所谓的 thin pack,但这些规则仅用于通过网络连接将对象发送到已具有基本对象的另一个 Git.其他 Git 必须修复"或增肥"瘦包以制作正常的包文件,然后再将其留给 Git 的其余部分.)

There's a general rule about pack files that states that any delta-compressed object within a pack file must have all its bases in the same pack file. This means that a pack file is self-contained: there's never a need to open multiple additional pack files to get an object out of a pack that has the object. (This particular rule can be deliberately violated, producing what Git calls a thin pack, but those are intended to be used only to send objects over a network connection to another Git that already has the base objects. The other Git must "fix" or "fatten" the thin pack to make a normal pack file, before leaving it behind for the rest of Git.)

对象可达性有点棘手.让我们先看看提交可达性.

Object reachability is a little bit tricky. Let's look first at commit reachability.

请注意,当我们有一个提交对象时,该提交对象本身包含多个哈希 ID.它有一个用于保存与该提交相关的快照的树的哈希 ID.它还具有一个或多个 父提交 的哈希 ID,除非此特定提交是 root 提交.根提交被定义为没有父提交的提交,所以这有点循环:提交有父提交,除非它没有父提交.不过很清楚:给定一些提交,我们可以将该提交绘制为图中的一个节点,箭头从节点出来,每个父节点一个:

Note that when we have a commit object, that commit object itself contains several hash IDs. It has one hash ID for the tree that holds the snapshot that goes with that commit. It also has one or more hash IDs for parent commits, unless this particular commit is a root commit. A root commit is defined as a commit with no parents, so this is a bit circular: a commit has parents, unless it has no parents. It's clear enough though: given some commit, we can draw that commit as a node in a graph, with arrows coming out of the node, one per parent:

<--o
   |
   v

这些父级箭头指向提交的父级或父级.给定一系列单亲提交,我们得到一个简单的线性链:

These parent arrows point to the commit's parent or parents. Given a series of single-parent commits we get a simple linear chain:

... <--o  <--o  <--o ...

其中一个提交必须是链的 start:即 root 提交.其中之一必须是 end,这就是 tip 提交.所有内部箭头都指向后(向左),所以我们可以在没有箭头的情况下绘制它,知道根在左边,尖端在右边:

One of these commits must be the start of the chain: that's the root commit. One of these must be the end, and that's the tip commit. All of the internal arrows point backwards (leftwards) so we can draw this without the arrow-heads, knowing that the root is at the left and the tip is at the right:

o--o--o--o--o

现在我们可以添加一个分支名称,比如master.该名称只是指向提示提交:

Now we can add a branch name like master. The name simply points to the tip commit:

o--o--o--o--o   <--master

嵌入中的任何箭头都不会改变,因为任何对象中的任何东西都不会改变.然而,分支名称master 中的箭头实际上只是某个提交的哈希ID,而这个可以 改变.让我们用字母来表示提交哈希:

None of the arrows embedded within a commit can ever change, because nothing in any object can ever change. The arrow in the branch name master, however, is actually just the hash ID of some commit, and this can change. Let's use letters to represent the commit hashes:

A--B--C--D--E   <-- master

名称 master 现在只存储提交 E 的提交哈希.如果我们向 master 添加一个新的提交,我们会写出一个提交,它的父节点是 E 并且它的树是我们的快照,给我们一个全新的哈希,我们可以称之为F.提交 F 指向 E.我们让 Git 将 F 的哈希 ID 写入 master,现在我们有了:

the name master now just stores the commit hash of commit E. If we add a new commit to master, we do this by writing out a commit whose parent is E and whose tree is our snapshot, giving us an all-new hash, which we can call F. Commit F points back to E. We have Git write F's hash ID into master and now we have:

A--B--C--D--E--F   <-- master

我们添加了一个提交并更改了一个名称,master.所有以前的提交都是可访问的,从名称开始 master.我们读出F的hash ID并读取提交F.这有 E 的哈希 ID,所以我们已经到达提交 E.我们读取E得到D的hash ID,从而到达D.我们重复,直到我们读取 A,发现它有 no 父级,并且完成.

We added one commit and changed one name, master. All the previous commits are reachable by starting at the name master. We read out the hash ID of F and read commit F. This has the hash ID of E, so we have reached commit E. We read E to get the hash ID of D, and thus reach D. We repeat until we read A, find that it has no parent, and are done.

如果有分支,那只是意味着我们有另一个名称找到的提交,其父级是名称master找到的提交之一:

If there are branches, that just means that we have commits found by another name whose parents are one of the commits also found by the name master:

A--B--C--D--E--F   <-- master
             
              G--H   <-- develop

名称develop定位commit HH 找到 G;而 G 又指代 E.所以所有这些提交都是可达的.

The name develop locates commit H; H finds G; and G refers back to E. So all of these commits are reachable.

与多个父级一起提交——即,合并提交——如果提交本身是可访问的,则使其所有父级都可访问.因此,一旦您进行了合并提交,您可以(但不必)删除标识已合并提交的分支名称:现在可以从您执行合并操作时所在的分支的尖端访问它.那就是:

Commits with more than one parent—i.e., merge commits—make all their parents reachable if the commit itself is reachable. So once you make a merge commit, you can (but do not have to) delete the branch name that identifies the commit that was merged-in: it's now reachable from the tip of the branch that you were on when you did the merge operation. That is:

...--o--o---o   <-- name
          /
       o--o   <-- delete-able

这里底行的提交可以从 name 通过合并访问,就像顶行的提交总是可以从 name 访问一样.删除名称 delete-able 仍然可以访问它们.如果合并提交是 not 那里,在这种情况下:

the commits on the bottom row here are reachable from name, through the merge, just as the commits on the top row were always reachable from name. Deleting the name delete-able leaves them still reachable. If the merge commit is not there, as in this case:

...--o--o   <-- name2
      
       o--o   <-- not-delete-able

然后删除 not-delete-able 有效地放弃底行的两个提交:它们变得无法访问,因此有资格进行垃圾回收.

then deleting not-delete-able effectively abandons the two commits along the bottom row: they become unreachable, and hence eligible for garbage-collection.

同样的可达性属性适用于树和 blob 对象.例如,提交 G 中有一个 tree,而这个 tree对:

This same reachability property applies to tree and blob objects. Commit G has a tree in it, for instance, and this tree has <name, ID> pairs:

A--B--C--D--E--F   <-- master
             
              G--H   <-- develop
              |
         tree=d097...
            /   
 README=9fa3... Makefile=0b41...

所以从提交Gtree对象d097...是可达的;从该树中,blob 对象 9fa3... 是可访问的,blob 对象 0b41... 也是如此.提交 H 可能有相同的 README 对象,在相同的名称下(虽然不同的树):这很好,只是使 9fa3 加倍可达,这对 Git 来说并不感兴趣:Git 只关心它是否可达.

So from commit G, tree object d097... is reachable; from that tree, blob object 9fa3... is reachable, and so is blob object 0b41.... Commit H might have the very same README object, under the same name (though a different tree): that's fine, that just makes 9fa3 doubly reachable, which is not interesting to Git: Git only cares that it is reachable at all.

外部引用——分支和标签名称,以及在 Git 存储库中找到的其他引用(包括 Git 的 index 中的条目以及通过链接添加的工作树的任何引用),提供进入对象图的入口点.从这些入口点,任何对象要么是可到达的——有一个或多个可以通向它的名称——要么是不可到达,这意味着没有可以找到对象本身的名称.我已经从这个描述中省略了带注释的标签,但它们通常是通过标签名称找到的,并且带注释的标签对象具有它找到的一个对象引用(任意对象类型),如果标签对象本身是可访问的,则使该对象可访问.

External references—branch and tag names, and other references found in Git repositories (including entries in Git's index and any references via linked added work-trees), provide the entry points into the object graph. From these entry points, any object is either reachable—has one or more names that can lead to it—or unreachable, meaning there are no names by which the object itself can be found. I've omitted annotated tags from this description, but they are generally found via tag names, and an annotated tag object has one object reference (of arbitrary object type) that it finds, making that one object reachable if the tag object itself is reachable.

因为引用只引用 one 对象,但有时我们使用分支名称做一些事后想要撤消的操作,Git 会为每个值保留一个 log 作为引用有,什么时候.这些参考日志或 reflogs 让我们知道 master 昨天 里面有什么,或者上周 develop 里面有什么.最终这些 reflog 条目是旧的和陈旧的,不太可能再有用了,git reflog expire 将丢弃它们.

Because references only refer to one object, but sometimes we do something with a branch name that we want to undo afterward, Git keeps a log of each value a reference had, and when. These reference logs or reflogs let us know what master had in it yesterday, or what was in develop last week. Eventually these reflog entries are old and stale and unlikely to be useful any more, and git reflog expire will discard them.

git repack 在高层次上做了什么,现在应该相当清楚:它将许多松散对象的集合转换为一个包含所有这些对象的包文件.不过,它可以做的更多:它可以包含前一个包中的所有对象.以前的包变得多余,之后可以删除.它还可以忽略包中的任何无法访问的对象,将它们变成松散的对象.当 git gc 运行 git repack 时,它会使用依赖于 git gc 选项的选项,因此这里的确切语义有所不同,但默认对于前台 git gc 是使用 git repack -d -l,它有 git repack 删除冗余包并运行 git prune-打包.prune-packed 程序会删除也出现在包文件中的松散对象文件,因此这会删除进入包中的松散对象.repack 程序将 -l 选项传递给 git pack-objects (这是构建包文件的实际主力),这意味着省略从其他存储库借来的对象.(最后一个选项对于大多数正常的 Git 使用来说并不重要.)

What git repack does, at a high level, should now be reasonably clear: it turns a collection of many loose objects into a pack file full of all those objects. It can do more, though: it can include all objects from a previous pack. The previous pack becomes superfluous and can be removed afterward. It can also omit any unreachable objects from the pack, turning them instead into loose objects. When git gc runs git repack it does so with options that depend on the git gc options, so the exact semantics vary here, but the default for a foreground git gc is to use git repack -d -l, which has git repack delete redundant packs and run git prune-packed. The prune-packed program removes loose object files that also appear in pack files, so this removes the loose objects that went into the pack. The repack program passes the -l option on to git pack-objects (which is the actual workhorse that builds the pack file) where it means to omit objects that are borrowed from other repositories. (This last option is not important for most normal Git usage.)

在任何情况下,是 git repack——或者技术上说,git pack-objects——打印计数、压缩和写入消息.完成后,您将拥有一个新的包文件,而旧的包文件已消失.新的包文件包含所有可达对象,包括旧可达的打包对象和旧的可达松散对象.如果松散的对象从一个旧的(现在已被拆除和删除的)包文件中弹出,它们会加入其他松散(且无法访问)的对象,使您的存储库变得混乱.如果它们在拆卸过程中被破坏,则只剩下现有的松散和无法访问的对象.

In any case, it's git repack—or technically, git pack-objects—that prints the counting, compressing, and writing messages. When it is done you have a new pack file and the old pack file(s) are gone. The new pack file holds all the reachable objects, including the old reachable packed objects and the old reachable loose objects. If loose objects were ejected from one of the old (now torn-down and removed) pack files, they join the other loose (and unreachable) objects cluttering your repository. If they were destroyed during the tear-down, only the existing loose-and-unreachable objects remain.

现在是 git prune 的时候了:它会找到松散的、无法访问的对象并将其删除.但是,它有一个安全开关,--expire 2.weeks.ago:默认情况下,由 git gc 运行,它没有如果这些物品不是至少两周大,请移除这些物品.这意味着任何正在创建新对象的 Git 程序,但尚未连接它们,都有一个宽限期.在 git prune 删除它们之前的十四天(默认情况下),新对象可能是松散且无法访问的.因此,一个忙于创建对象的 Git 程序有 14 天的时间可以完成将这些对象连接到图表中.如果它认为这些对象不值得连接,它可以离开它们;14 天后,未来的 git prune 将删除它们.

It's now time for git prune: this finds loose, unreachable objects and removes them. However, it has a safety switch, --expire 2.weeks.ago: by default, as run by git gc, it does not remove such objects if they are not at least two weeks old. This means that any Git program that is in the process of creating new objects, that has not yet hooked them up, has a grace period. The new objects can be loose and unreachable for (by default) fourteen days before git prune will delete them. So a Git program that is busy creating objects has fourteen days during which it can complete the hooking-up of those objects into the graph. If it decides those objects are not worth hooking-up, it can just leave them; 14 days from that point, a future git prune will remove them.

如果你手动运行 git prune,你必须选择你的 --expire 参数.没有 --expire 的默认值不是 2.weeks.ago 而只是 now.

If you run git prune manually, you must choose your --expire argument. The default without --expire is not 2.weeks.ago but instead just now.

1树对象实际上包含三元组:名称、模式、哈希.模式为 100644100755 用于 blob 对象,004000 用于子树,120000 用于符号链接等等.

1Tree objects actually hold triples: name, mode, hash. The mode is 100644 or 100755 for a blob object, 004000 for a sub-tree, 120000 for a symbolic link, and so on.

2为了在 Linux 上的查找速度,哈希在前两个字符之后被拆分:哈希名称 ab34ef56... 变为 ab/34e567....git/objects 目录中.这使 .git/objects 中的每个子目录的大小保持较小,从而驯服了某些目录操作的 O(n2) 行为.这与 git gc --auto 联系在一起,当一个对象目录变得足够大时,它会自动重新打包.Git假设每个子目录的大小与哈希值大致相同,应该是均匀分布的,所以它只需要计算一个子目录.

2For lookup speed on Linux, the hash is split after the first two characters: the hash name ab34ef56... becomes ab/34e567... in the .git/objects directory. This keeps the size of each subdirectory within .git/objects small-ish, which tames O(n2) behavior of some directory operations. This ties in with git gc --auto which repacks automatically when one object directory becomes sufficiently large. Git assumes that each subdirectory is about the same size as the hashes should mostly be uniformly distributed, so it only needs to count one subdirectory.

这篇关于当我们这样做时 git 会做什么:git gc - git prune的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆