git是将更改后的大文件完全上传到远程,还是可以上传差异? [英] Does git upload a changed large file entirely to remote, or could just upload the differences?

查看:53
本文介绍了git是将更改后的大文件完全上传到远程,还是可以上传差异?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个大文本文件,它会在某些部分中定期更改.我想使其与git服务器上的远程版本保持同步,最好仅上传其更改的部分.

Assuming I have a big text file, and it would change in some parts periodically. I want to keep it synchronized with its remote version on a git server, preferably by just uploading its changed portions.

git的默认行为是什么?git每次更改后都会上传整个文件吗?还是可以选择仅上传差异?

What's the default behavior of git? Does git upload the entire file each time it has been changed? Or has an option to upload just the differences?

非文本(二进制)文件呢?

What about non-text (binary) files?

谢谢

推荐答案

git每次更改后都会上传[一个]完整文件吗?还是可以选择仅上传差异?

Does git upload [an] entire file each time it has been changed? Or has an option to upload just the differences?

对此的答案实际上是取决于".

The answer to this is actually "it depends".

您正在描述的系统-在这里我们说给定现有文件F,使用F的第一部分,然后插入或删除该位,然后使用F的另一部分".依此类推-称为 delta压缩 delta编码 .

The system you're describing—where we say "given existing file F, use the first part of F, then insert or delete this bit, then use another part of F" and so on—is called delta compression or delta encoding.

Tim Biegeleisen回答,Git 存储(至少在逻辑上是完整副本)每个提交文件都带有一次提交(但具有重复数据删除功能,因此,如果提交A和提交B都存储某个文件的相同副本,则它们共享一个存储的副本).Git将这些存储的副本称为 objects .但是,Git 可以在Git所谓的 pack文件中对这些对象进行增量压缩.

As Tim Biegeleisen answered, Git stores—logically, at least—a complete copy of each file with each commit (but with de-duplication, so if commits A and B both store the same copy of some file, they share a single stored copy). Git calls these stored copies objects. However, Git can do delta-compression of these objects within what Git calls pack files.

当一个Git需要将内部对象发送到另一个Git来提供提交及其文件时,它可以:

When one Git needs to send internal objects to another Git, to supply commits and their files, it can either:

  • 逐个发送单个对象,或
  • 发送包含对象打包版本的打包文件.

如果您使用发送打包文件的Git协议,则Git只能在此处使用增量压缩.您可以轻松判断出您是否正在使用打包文件,因为在 git push 之后,您将看到:

Git can only use delta-compression here if you use a Git protocol that sends a pack file. You can easily tell if you're using pack files because after git push you will see:

Counting objects: ... done
Compressing objects: ... done

压缩阶段是在构建打包文件时发生的.无法保证当Git压缩对象时,它专门针对另一个Git已经具有的对象的某些版本 did 使用delta-compression.但这是目标,通常是这样(除了Git 2.26中引入的错误以及Git 2.27中修复的错误).

This compressing phase occurs while building the pack file. There's no guarantee that when Git compressed the object, it specifically did use delta-compression against some version of the object that the other Git already has. But that's the goal and usually will be the case (except for a bug introduced in Git 2.26 and fixed in Git 2.27).

有一个关于 git fetch git push 明确违反的打包文件的一般规则.但是,要真正理解这一切是如何工作的,我们应该首先描述此一般规则.

There is a general rule about pack files that git fetch and git push explicitly violate. To really understand how this all works, though, we should first describe this general rule.

Git有一个程序(以及各种内部函数,如果需要,可以在需要时更直接使用),该程序仅使用一组原始对象或某些现有的打包文件或两者来构建新的打包文件.无论如何,此处要使用的规则是新的打包文件应完全独立.也就是说,打包文件 PF 中的任何对象只能相对于也在 PF 中的其他对象进行增量压缩.因此,给定一组对象O 1 ,O 2 ,...,O n ,唯一允许的增量压缩是压缩一些对象.O i 对出现在同一打包文件中的某些O j .

Git has a program (and various internal functions that can be used more directly if/as needed) that builds a new pack file using just a set of raw objects, or some existing pack file(s), or both. In any case, the rule to be used here is that the new pack file should be completely self-contained. That is, any object inside pack file PF can only be delta-compressed against other objects that are also inside PF. So given a set of objects O1, O2, ..., On, the only delta-compression allowed is to compress some Oi against some Oj that appears in this same pack file.

至少一个对象始终是基础对象,即完全不压缩.我们将此对象称为O b .可以将另一个对象针对O b1 进行压缩,从而生成一个新的压缩对象O c1 .然后,可以将另一个对象直接针对任一O b1 进行压缩, or 对O c1 .或者,如果下一个对象似乎毕竟对O b1 压缩得不好,则它可以是另一个基础对象O b2 .假设下一个对象 被压缩,我们称其为O c2 .如果针对O c1 进行压缩,则这是一个 delta链:解压缩 O c2 ,Git将必须读取O c2 ,看到它链接到O c1 ,读取O c1 ,看到它链接到O b1 ,然后检索O b1 .然后可以应用O c1 减压规则获得解压缩的O c1 ,然后然后减压O c2 .

At least one object is always a base object, i.e., is not compressed at all. Let's call this object Ob. Another object can be compressed against Ob1, producing a new compressed object Oc1 Then, another object can be compressed against either Ob1 directly, or against Oc1. Or, if the next object doesn't seem to compress well against Ob1 after all, it can be another base object, Ob2. Assuming the next object is compressed, let's call it Oc2. If it's compressed against Oc1, this is a delta chain: to decompress Oc2, Git will have to read Oc2, see that it links to Oc1, read Oc1, see that it links to Ob1, and retrieve Ob1. Then it can apply the Oc1 decompression rules to get the decompressed Oc1, and then the decompression rules for Oc2.

由于所有这些对象都在一个打包文件中,因此Git只需要保持一个文件打开.但是,对非常长的链进行解压缩可能需要在文件中进行大量跳转,以查找各种对象并应用其增量.因此,δ链的长度受到限制.Git还尝试以一种物理方式将对象放置在压缩文件中,即使隐含跳跃,该方法也可以有效地读取(单个)压缩文件.

Since all these objects are in a single pack file, Git only needs to hold one file open. However, decompressing a very long chain can require a lot of jumping around in the file, to find the various objects and apply their deltas. The delta chain length is therefore limited. Git also tries to place the objects, physically within the pack file, in a way that makes reading the (single) pack file efficient, even with the implied jumping-around.

为了遵守所有这些规则,Git有时会在您的存储库中构建一个每个对象的全新打包文件,但只是偶尔地.在构建这个新的打包文件时,Git使用以前的打包文件作为指导,以指示哪些先前打包的对象相对于其他先前打包的对象压缩得很好.然后,它只需要花费大量CPU时间来查看 new (因为存在上一包文件)对象,以查看哪些对象压缩得好,因此在构建链等时应该使用哪个顺序等..如果某个旧的打包文件(由于任何可能)构造不佳,则可以关闭此选项并完全从头开始构建打包文件,而 git gc --aggressive 可以这样做.您还可以调整各种尺寸:请参见 git repack .

To obey all these rules, Git sometimes builds an entirely new pack file of every object in your repository, but only now and then. When building this new pack file, Git uses the previous pack file(s) as a guide that indicates which previously-packed objects compress well against which other previously-packed objects. It then only has to spend a lot of CPU time looking at new (since previous-pack-file) objects, to see which ones compress well and therefore which order it should use when building chains and so on. You can turn this off and build a pack file entirely from scratch, if some previous pack file was (by whatever chance) poorly constructed, and git gc --aggressive does this. You can also tune various sizes: see the options for git repack.

对于 git fetch git push ,包构建代码关闭所有对象必须出现在包中".选项.取而代之的是,将告诉delta压缩器它应该假设存在某些对象 .因此,它可以将这些对象中的任何一个用作基础链对象.当然,假定存在的对象 必须可以在某个地方找到.因此,当您的Git与其他Git交谈时,他们会通过其哈希ID来谈论提交.

For git fetch and git push, the pack building code turns off the "all objects must appear in the pack" option. Instead, the delta compressor is informed that it should assume that some set of objects exist. It can therefore use any of these objects as a base-or-chain object. The assumed-to-exist objects must be findable somewhere, somehow, of course. So when your Git talks to the other Git, they talk about commits, by their hash IDs.

如果您要推送,则您的Git是必须构建一个打包文件的Git;如果您要提取内容,则交换的两面效果相同.假设您正在推动这里.

If you are pushing, your Git is the one that has to build a pack file; if you're fetching, this works the same with the sides swapped. Let's assume you are pushing here.

您的Git告诉他们:我提交了X .他们的Git告诉您:我也有X 我没有X .如果他们要做 X ,则您的Git立即知道两件事:

Your Git tells theirs: I have commit X. Their Git tells yours: I too have X or I don't have X. If they do have X, your Git immediately knows two things:

  1. 他们还有所有 X 的祖先. 1
  2. 因此,它们具有所有 X 的树和blob对象,以及其所有始祖的的树和blob对象.
  1. They also have all of X's ancestors.1
  2. Therefore they have all of X's tree and blob objects, plus all of its ancestors' tree and blob objects.

很显然,如果他们确实提交了 X ,则您的Git不需要发送它.您的Git只会发送 X 的后代(也许会提交 Y Z ).但是通过上面的第2项,您的Git现在可以构建一个打包文件,您的Git只是假设,他们的Git具有导致(包括)提交 X的所有历史记录中的每个文件..

Obviously, if they do have commit X, your Git need not send it. Your Git will only send descendants of X (commits Y and Z, perhaps). But by item 2 above, your Git can now build a pack file where your Git just assumes that their Git has every file that is in all the history leading up to, and including, commit X.

因此,这是假定对象存在"的地方.代码真正开始了:如果您在提交 Y Z 中修改了文件 F1 F2 ,但没有触摸其他任何东西,它们不需要任何其他文件,并且可以将新的 F1 F2 文件针对 any 进行增量压缩提交 X 或其任何祖先的对象.

So this is where the "assume objects exist" code really kicks in: if you modified files F1 and F2 in commits Y and Z, but didn't touch anything else, they don't need any of the other files—and your new F1 and F2 files can be delta-compressed against any object in commit X or any of its ancestors.

生成的打包文件称为瘦包.构建精简包后,您的推送(或其对获取的响应者)会将精简包发送到整个网络.它们(用于您的推送或您的获取)现在必须修复"此瘦包,使用 git index-pack --fix-thin .修复瘦包只需打开它,找到所有增量链及其对象ID,然后在存储库中找到这些对象即可.请记住,我们保证它们在 somewhere 处都可以找到–并将这些物品放入包装中,使其不再稀薄.

The resulting pack file is called a thin pack. Having built the thin pack, your push (or their responder to your fetch) sends the thin pack across the network. They (for your push, or you for your fetch) must now "fix" this thin pack, using git index-pack --fix-thin. Fixing the thin pack is simply a matter of opening it up, finding all the delta chains and their object IDs, and finding those objects in the repository—remember, we've guaranteed that they are findable somewhere—and putting those objects into the pack, so that it's no longer thin.

加脂的包装袋必须有足够大的尺寸,以容纳所有需要容纳的物品.但是它们没有比它大的东西了-它们不保存每个对象,只保存需要保存的对象.因此,旧的打包文件仍然存在.

The fattened packs are as big as they have to be, to hold all the objects they need to hold. But they're no bigger than that—they don't hold every object, only the ones they need to hold. So the old pack files remain.

过一会儿,存储库会建立大量的打包文件.在这一点上,Git决定是时候精简事情了,将多个打包文件重新打包到一个可以容纳所有内容的单个打包文件中.这使它可以完全删除冗余的打包文件. 2 的默认值为50个打包文件,因此一旦累积了50个单独的包(通常通过50次提取或推送操作)后, git gc--auto 将调用重新打包步骤,您将放回一个打包文件.

After a while, a repository builds up a large number of pack files. At this point, Git decides that it's time to slim things down, re-packing multiple pack files into one single pack file that will hold everything. This allows it to delete redundant pack files entirely.2 The default for this is 50 pack files, so once you've accumulated 50 individual packs—typically via 50 fetch or push operations—git gc --auto will invoke the repack step and you'll drop back to one pack file.

请注意,这种重新包装对薄包装没有影响:这些包装仅取决于目标物体的存在 ,并且这种存在隐含 一个Git有一个提交.进行提交意味着拥有其所有祖先(尽管再次参见脚注1),因此一旦我们看到另一个Git提交了 X ,我们就完成了这部分计算,并可以构建我们的薄包装.

Note that this repacking has no effect on the thin packs: those depend only on the existence of the objects of interest, and this existence is implicit in the fact that a Git has a commit. Having a commit implies having all of its ancestors (though see footnote 1 again), so once we see that the other Git has commit X we're done with this part of the computation, and can build our thin pack accordingly.

1 浅克隆违反了此所有祖先"规则,统治和使事情复杂化,但是我们真的不需要在这里详细介绍.

1Shallow clones violate this "all ancestors" rule and complicate things, but we don't really need to go into the details here.

2 在某些情况下,最好保留旧包装;为此,您只需创建一个文件包,文件名以 .keep 结尾.这主要用于共享-reference 存储库的那些设置.

2In some situations it's desirable to keep an old pack; to do so, you just create a file with the pack's name ending in .keep. This is mostly for those setups where you're sharing a --reference repository.

这篇关于git是将更改后的大文件完全上传到远程,还是可以上传差异?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆