为什么Git无法处理大文件和大量回购? [英] Why can't Git handle large files and large repos?

查看:110
本文介绍了为什么Git无法处理大文件和大量回购?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SO和其他地方的数十个问题和答案强调,Git无法处理大型文件或大型回购。建议使用一些解决方法,例如 git-fat <​​/a>和 git-annex ,但理想情况下,Git会原生处理大文件/回购。



如果这个限制已经存在多年了,是否有理由限制尚未被删除?我认为在Git中存在一些技术或设计挑战,这使得大文件和大回购支持变得非常困难。

很多相关的问题,但似乎没有解释为什么这是一个很大的障碍:


解决方案

基本上,这归结于折衷。 您的问题之一有一个来自Linus自己的例子:


[...] CVS,也就是说它真的最终面向一次一个文件模式。



这样做很好,因为您可以拥有一百万个文件,然后只查看其中的几个文件 - 您甚至不会看到其他999,995个文件的影响。



从根本上说,Git从来没有真正看过整个回购。即使你限制了一些东西(比如只查看一部分内容,或者让历史记录稍微回退一点),git最终仍然会关注整个事情,并将知识带到身边。



因此,如果强制将所有内容都视为一个巨大的存储库,那么git会非常糟糕。我不认为这部分是真正可以修复的,尽管我们可以改进它。



是的,然后是大文件问题。我真的不知道如何处理大文件。我们知道,它们是吸引人的。


就像你不会找到一个具有O(1)索引访问和插入的数据结构,你不会发现一个内容跟踪器,它可以很好地完成任何事情。



Git故意选择在某些方面做得更好,而不利于其他方面。






磁盘使用率

由于Git是DVCS (分布式版本控制系统),每个人都有整个repo的副本(除非使用相对较新的浅层克隆)。



这有一些真正的好处,这就是为什么像Git这样的DVCS变得非常流行的原因。然而,4 TB在SVN或CVS的中央服务器上进行回购是可以管理的,而如果你使用Git,每个人都不会为此感到兴奋。



Git有最小化通过在文件间创建增量链(差异)来实现回购的大小。 Git不受创建路径或提交命令的约束,而且它们确实工作得很好....有点像gzip整个repo。



Git将所有这些差异到包装文件中。 Delta链和包文件使检索对象花费更长的时间,但这对于最大限度地减少磁盘使用量非常有效。 (还有一些折衷。)

这种机制对于二进制文件并不适用,因为它们往往会有很大不同,即使经过小改变。



历史


当你签入一个文件时,你永远拥有它。您的孙辈的孙辈的孙辈每次克隆您的repo时都会下载您的cat gif。



这当然不是git独有的,作为DCVS的后果更为显着。

虽然可以删除文件,但git的基于内容的设计(每个对象ID都是其内容的SHA)文件困难,侵入性和对历史的破坏性。相比之下,我可以从工件库或S3存储桶中删除Crufty二进制文件,而不会影响其他内容。






难度



使用真正大的文件需要很多精心的工作,以确保您最大限度地减少操作,并且永远不会将整个内容加载到内存中。在创建一个像git一样复杂的功能集的程序时,这非常难以可靠地执行。




结论最终,开发人员说不要在Git中放置大文件有点像那些说不要在数据库中放置大文件的人。他们不喜欢它,但任何替代品都有缺点(在一种情况下Git集成,ACID兼容,FK与其他)。实际上,它通常可以正常工作,特别是如果你有足够的记忆。

它的功能不如它的设计。


Dozens of questions and answers on SO and elsewhere emphasize that Git can't handle large files or large repos. A handful of workarounds are suggested such as git-fat and git-annex, but ideally Git would handle large files/repos natively.

If this limitation has been around for years, is there are reason the limitation has not yet been removed? I assume that there's some technical or design challenge baked into Git that makes large file and large repo support extremely difficult.

Lots of related questions, but none seem to explain why this is such a big hurdle:

解决方案

Basically, it comes down to tradeoffs.

One of your questions has an example from Linus himself:

[...] CVS, ie it really ends up being pretty much oriented to a "one file at a time" model.

Which is nice in that you can have a million files, and then only check out a few of them - you'll never even see the impact of the other 999,995 files.

Git fundamentally never really looks at less than the whole repo. Even if you limit things a bit (ie check out just a portion, or have the history go back just a bit), git ends up still always caring about the whole thing, and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one huge repository. I don't think that part is really fixable, although we can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to do about huge files. We suck at them, I know.

Just as you won't find a data structure with O(1) index access and insertion, you won't find a content tracker that does everything fantastically.

Git has deliberately chosen to be better at some things, to the detriment of others.


Disk usage

Since Git is DVCS (Distributed version control system), everyone has a copy of the entire repo (unless you use the relatively recent shallow clone).

This has some really nice advantages, which is why DVCSs like Git have become insanely popular.

However, a 4 TB repo on a central server with SVN or CVS is manageable, whereas if you use Git, everyone won't be thrilled with carrying that around.

Git has nifty mechanisms for minimizing the size of your repo by creating delta chains ("diffs") across files. Git isn't constrained by paths or commit orders in creating these, and they really work quite well....kinda of like gzipping the entire repo.

Git puts all these little diffs into packfiles. Delta chains and packfiles makes retrieving objects take a little longer, but this it is very effective at minimizing disk usage. (There's those tradeoffs again.)

That mechanism doesn't work as well for binary files, as they tend to differ quite a bit, even after a "small" change.


History

When you check in a file, you have it forever and ever. Your grandchildren's grandchildren's grandchildren will download your cat gif every time they clone your repo.

This of course isn't unique to git, being a DCVS makes the consequences more significant.

And while it is possible to remove files, git's content-based design (each object id is a SHA of its content) makes removing those files difficult, invasive, and destructive to history. In contrast, I can delete crufty binary from an artifact repo, or an S3 bucket, without affecting the rest of my content.


Difficulty

Working with really large files requires a lot of careful work, to make sure you minimize your operations, and never load the whole thing in memory. This is extremely difficult to do reliably when creating a program with as complex a feature set as git.


Conclusion

Ultimately, developers who say "don't put large files in Git" are a bit like those who say "don't put large files in databases". They don't like it, but any alternatives have disadvantages (Git intergration in the one case, ACID compliance and FKs with the other). In reality, it usually works okay, especially if you have enough memory.

It just doesn't work as well as it does with what it was designed for.

这篇关于为什么Git无法处理大文件和大量回购?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆