Git 如何既节省空间又快速? [英] How does Git save space and is fast at the same time?

查看:23
本文介绍了Git 如何既节省空间又快速?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚在 Git 教程://blip.tv/play/Aeu2CAI" rel="noreferrer">http://blip.tv/play/Aeu2CAI.

I just saw the first Git tutorial at http://blip.tv/play/Aeu2CAI.

Git 是如何存储所有文件的所有版本的,如何在空间上仍然比 Subversion 哪个只保存最新版本的代码?

How does Git store all the versions of all the files, and how can it still be more economical in space than Subversion which saves only the latest version of the code?

我知道这可以使用压缩来完成,但这会以速度为代价,但这也说明 Git 的速度要快得多(尽管它获得最大收益的地方是它的大部分操作都处于离线状态).

I know this can be done using compression, but that would be at the cost of speed, but this also says that Git is much faster (though where it gains the maximum is the fact that most of its operations are offline).

所以,我的猜测是

  • Git 广泛地压缩数据
  • 它仍然更快,因为解压+工作仍然比network_fetch + work
  • 更快

我说得对吗?甚至接近?

Am I correct? Even close?

推荐答案

我假设您是在问 git clone(完整存储库 + 检出)如何比 Subversion 中检出的源更小.或者你的意思是别的?

I assume you are asking how it is possible for a git clone (full repository + checkout) to be smaller than checked-out sources in Subversion. Or did you mean something else?

此问题已在评论中回答

首先,您应该考虑到结帐(工作版本)Subversion 将原始副本(最新版本)存储在那些 .svn 子目录中.原始副本未压缩存储在 Subversion 中.

First you should take into account that along checkout (working version) Subversion stores pristine copy (last version) in those .svn subdirectories. Pristine copy is stored uncompressed in Subversion.

其次,git 使用以下技术使存储库更小:

Second, git uses the following techniques to make repository smaller:

  • 一个文件的每个版本只存储一次;这意味着如果某个文件在 10 次修订(10 次提交)中只有两个不同版本,git 只会存储这两个版本,而不是 10.
  • 对象(和增量,见下文)被压缩存储;编程中使用的文本文件压缩得非常好(原始大小的 60% 左右,或压缩后大小减少 40%)
  • 重新打包后,对象以deltified形式存储,与其他版本不同;此外,git 尝试以这样一种方式对增量链进行排序,即增量主要由删除组成(在通常情况下,增加的文件是按新近顺序排列的);IIRC 增量也被压缩.

首先,任何涉及网络的操作都会比本地操作慢得多.因此例如将当前工作区状态与其他版本进行比较,或者获取日志(历史记录),在Subversion中涉及网络连接和网络传输,而在Git中是本地操作,当然在Subversion中会慢得多在 Git 中.顺便提一句.这是集中版本控制系统(使用客户端-服务器工作流)和分布式版本控制系统(使用对等工作流)之间的区别,不仅在 Subversion 和吉特.

First, any operation that involves network would be much slower than a local operation. Therefore for example comparing current state of working area with some other version, or getting a log (a history), which in Subversion involves network connection and network transfer, and in Git is a local operation, would of course be much slower in Subversion than in Git. BTW. this is the difference between centralized version control systems (using client-server workflow) and distributed version control systems (using peer-to-peer workflow), not only between Subversion and Git.

第二,如果我理解正确的话,现在的限制不是CPU而是IO(磁盘访问).因此,由于压缩而不得不从磁盘读取较少数据(并且能够将其映射到内存中)所带来的收益有可能克服了解压缩数据所带来的损失.

Second, if I understand it correctly, nowadays the limitation is not CPU but IO (disk access). Therefore it is possible that the gain from having to read less data from disk because of compression (and being able to mmap it in memory) overcomes the loss from having to decompress data.

第三,Git 的设计考虑了性能(参见例如 GitHistory 页面在 Git Wiki 上):

Third, Git was designed with performance in mind (see e.g. GitHistory page on Git Wiki):

  • 索引存储文件的统计信息,Git 使用它来决定文件是否被修改而无需检查文件(参见例如 core.trustctime 配置变量).
  • 最大增量深度限制为 pack.depth,默认为 50.Git 具有增量缓存以加快访问速度.有(生成的)packfile 索引,用于快速访问 packfile 中的对象.
  • Git 会注意不要接触它不需要的文件.例如,在切换分支或回退到另一个版本时,Git 仅更新更改的文件.这种理念的结果是 Git 确实只支持非常小的关键字扩展(至少开箱即用).
  • Git 使用它的 自己的版本 LibXDiff 库,现在也用于差异和合并,而不是调用外部差异/外部合并工具.
  • Git 试图最大限度地减少延迟,这意味着良好的感知性能.例如,它尽可能快地输出git log"的第一页,并且您几乎立即看到它,即使生成完整历史需要更多时间;它不会等待生成完整的历史记录才显示它.
  • 在获取新更改时,Git 会检查您与服务器共有哪些对象,并仅以瘦包文件的形式发送(压缩)差异.诚然,Subversion 可以(或者可能默认情况下确实如此)在更新时也仅发送差异.
  • The index stores stat information for files, and Git uses it to decide without examining files if the files were modified or not (see e.g. core.trustctime config variable).
  • The maximum delta depth is limited to pack.depth, which defaults to 50. Git has delta cache to speed up access. There is (generated) packfile index for fast access to objects in packfile.
  • Git takes care to not touch files it doesn't have to. For example when switching branches, or rewinding to another version, Git updates only files that changed. The consequence of this philosophy is that Git does support only very minimal keyword expansion (at least out of the box).
  • Git uses its own version of LibXDiff library, nowadays also for diff and merge, instead of calling external diff / external merge tool.
  • Git tries to minimize latency, which means good perceived performance. For example it outputs first page of "git log" as fast as possible, and you see it almost immediately, even if generating full history would take more time; it doesn't wait for full history to be generated before displaying it.
  • When fetching new changes, Git checks what objects you have in common with the server, and sends only (compressed) differences in the form of thin packfile. Admittedly Subversion can (or perhaps by default it does) also send only differences when updating.

我不是 Git 黑客,我可能错过了 Git 用于提高性能的一些技术和技巧.但是请注意,Git 大量使用 POSIX(如内存映射文件),因此在 MS Windows 上增益可能没有那么大.

I am not a Git hacker, and I probably missed some techniques and tricks that Git uses for better performance. Note however that Git heavily uses POSIX (like memory mapped files) for that, so the gain might be not as large on MS Windows.

这篇关于Git 如何既节省空间又快速?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆