对于高延迟磁盘,许多文件的Git提取速度很慢 [英] Git fetch for many files is slow against a high-latency disk

查看:92
本文介绍了对于高延迟磁盘,许多文件的Git提取速度很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对这里感兴趣的是对git内部的一些了解-

What I'm interested in here is some insight into git's internals -

如果我在Bitbucket上远程托管了一个包含许多文件的存储库(例如〜25000,它们的大小都在2K左右),为什么在针对高延迟磁盘的情况下第一次提取如此之慢?

If I have a repo hosted remotely on Bitbucket with many files (say ~25000, they're all around 2K in size), why is the first fetch so slow when targeting a high-latency disk?

由于需要写入大量文件,因此我希望像第一次签出这样的操作会很慢,但是获取仅应接收少量的元数据和打包文件并将其写入磁盘.磁盘具有高延迟性,但吞吐量很好,因此写入少量大文件的性能通常很好.

I would expect operations like the first checkout to be slow, due to the need to write lots of files, but the fetch should only be receiving a handful of metadata and pack files and writing those to disk. The disk is high-latency but throughput is fine, so the performance of writing a small number of large files is generally fine.

推荐答案

获取仅应接收少量元数据和打包文件并将其写入磁盘.

The fetch should only be receiving a handful of metadata and pack files and writing those to disk.

仍然,Git 2.20(2018年第四季度)将提高获取速度.

Still, Git 2.20 (Q4 2018) will improve fetching speed.

这是因为,当创建一个精简包时,它允许将对象与另一个不在生成包中但已知存在于接收端的对象形成一个增量,该代码学会了利用可达性位图;这样,服务器就可以针对超出边界"提交的基准发送增量.

That is because, when creating a thin pack, which allows objects to be made into a delta against another object that is not in the resulting pack but is known to be present on the receiving end, the code learned to take advantage of the reachability bitmap; this allows the server to send a delta against a base beyond the "boundary" commit.

请参见提交6a1e32d 提交22bec79 提交5a924a6 Jeff King( peff).
(由 Junio C Hamano-gitster-

See commit 6a1e32d, commit 30cdc33 (21 Aug 2018), and commit 198b349, commit 22bec79, commit 5a924a6, commit 968e77a (17 Aug 2018) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 3ebdef2, 17 Sep 2018)

pack-objects:对精简的拥有"对象重用磁盘增量

当我们为fetch服务时,我们将从获取协商中将希望"和拥有"传递给打包对象.这样不仅可以告诉我们需要发送哪些对象,还可以使用 边界提交为首选基础":它们的树和斑点是增量基础的候选对象,既可以重用磁盘上的增量也可以找到新的基础.

pack-objects: reuse on-disk deltas for thin "have" objects

When we serve a fetch, we pass the "wants" and "haves" from the fetch negotiation to pack-objects. That tells us not only which objects we need to send, but we also use the boundary commits as "preferred bases": their trees and blobs are candidates for delta bases, both for reusing on-disk deltas and for finding new ones.

但是,这错过了一些机会.模制一些特殊情况,例如浅克隆或部分克隆,我们知道从"haves"可访问的每个对象都可能是首选的基础.
我们没有全部使用它们是出于两个原因:

However, this misses some opportunities. Modulo some special cases like shallow or partial clones, we know that every object reachable from the "haves" could be a preferred base.
We don't use all of them for two reasons:

  1. 遍历整个历史并枚举另一边所有的物体是很昂贵的.
  2. 增量搜索非常昂贵,因此我们希望保持候选碱基的数量合理.边界提交最有可能起作用.

但是,当我们具有可到达性位图时,原因1不再适用.
我们可以在另一侧高效地计算可到达对象的集合(实际上,这已经作为位图集合差异的一部分进行了处理,从而获得了有趣的对象的列表).并且使用此集方便地涵盖了浅层和部分情况,因为无论如何我们都必须禁止使用位图.

When we have reachability bitmaps, though, reason 1 no longer applies.
We can efficiently compute the set of reachable objects on the other side (and in fact already did so as part of the bitmap set-difference to get the list of interesting objects). And using this set conveniently covers the shallow and partial cases, since we have to disable the use of bitmaps for those anyway.

第二个理由反对在寻找新的增量时使用这些基数.

The second reason argues against using these bases in the search for new deltas.

但是在一种情况下,我们可以免费使用此信息:当我们具有要考虑重用的现有磁盘增量时,如果我们知道另一端具有基础对象,则可以这样做.实际上,这样可以节省增量搜索的时间,因为它减少了我们要计算的增量.

But there's one case where we can use this information for free: when we have an existing on-disk delta that we're considering reusing, we can do so if we know the other side has the base object. This in fact saves time during the delta search, because it's one less delta we have to compute.

这正是此修补程序的作用:当我们考虑是否要重用磁盘增量时,如果位图告诉我们另一面有对象(我们正在制作一个 薄包装),然后我们重新使用它.

And that's exactly what this patch does: when we're considering whether to reuse an on-disk delta, if bitmaps tell us the other side has the object (and we're making a thin-pack), then we reuse it.

以下是使用linux.gitp5311上的结果,该结果模拟了自上次提取后N天后客户的提取:

Here are the results on p5311 using linux.git, which simulates a client fetching after N days since their last fetch:

 Test                         origin              HEAD
 --------------------------------------------------------------------------
 5311.3: server   (1 days)    0.27(0.27+0.04)     0.12(0.09+0.03) -55.6%
 5311.4: size     (1 days)               0.9M              237.0K -73.7%
 5311.5: client   (1 days)    0.04(0.05+0.00)     0.10(0.10+0.00) +150.0%
 5311.7: server   (2 days)    0.34(0.42+0.04)     0.13(0.10+0.03) -61.8%
 5311.8: size     (2 days)               1.5M              347.7K -76.5%
 5311.9: client   (2 days)    0.07(0.08+0.00)     0.16(0.15+0.01) +128.6%
 5311.11: server   (4 days)   0.56(0.77+0.08)     0.13(0.10+0.02) -76.8%
 5311.12: size     (4 days)              2.8M              566.6K -79.8%
 5311.13: client   (4 days)   0.13(0.15+0.00)     0.34(0.31+0.02) +161.5%
 5311.15: server   (8 days)   0.97(1.39+0.11)     0.30(0.25+0.05) -69.1%
 5311.16: size     (8 days)              4.3M                1.0M -76.0%
 5311.17: client   (8 days)   0.20(0.22+0.01)     0.53(0.52+0.01) +165.0%
 5311.19: server  (16 days)   1.52(2.51+0.12)     0.30(0.26+0.03) -80.3%
 5311.20: size    (16 days)              8.0M                2.0M -74.5%
 5311.21: client  (16 days)   0.40(0.47+0.03)     1.01(0.98+0.04) +152.5%
 5311.23: server  (32 days)   2.40(4.44+0.20)     0.31(0.26+0.04) -87.1%
 5311.24: size    (32 days)             14.1M                4.1M -70.9%
 5311.25: client  (32 days)   0.70(0.90+0.03)     1.81(1.75+0.06) +158.6%
 5311.27: server  (64 days)   11.76(26.57+0.29)   0.55(0.50+0.08) -95.3%
 5311.28: size    (64 days)             89.4M               47.4M -47.0%
 5311.29: client  (64 days)   5.71(9.31+0.27)     15.20(15.20+0.32) +166.2%
 5311.31: server (128 days)   16.15(36.87+0.40)   0.91(0.82+0.14) -94.4%
 5311.32: size   (128 days)            134.8M              100.4M -25.5%
 5311.33: client (128 days)   9.42(16.86+0.49)    25.34(25.80+0.46) +169.0%

在所有情况下,我们都可以节省服务器上的CPU时间(有时很长),并且结果包更小.
我们确实在客户端上花费了更多的CPU时间,因为它必须重建更多的增量.

In all cases we save CPU time on the server (sometimes significant) and the resulting pack is smaller.
We do spend more CPU time on the client side, because it has to reconstruct more deltas.

但这是正确的权衡,因为客户端倾向于数量超过服务器.
这只是意味着瘦包机制正在发挥作用.

But that's the right tradeoff to make, since clients tend to outnumber servers.
It just means the thin pack mechanism is doing its job.

从用户的角度来看,操作的端到端时间通常会更快.例如,在128天的情况下,我们在服务器上节省了15秒,而在服务器上节省了16秒 客户.
由于生成的数据包小34MB,因此,如果网络速度低于270Mbit/s,这将是一次净赢.这实际上是最坏的情况.
这个为期64天的案例节省了11秒钟以上的时间,而成本却不到11秒钟.所以这是一个小小的胜利 在任何网络速度下,节省的40MB都是纯粹的奖励.
对于较小的获取量,这种趋势仍在继续.

From the user's perspective, the end-to-end time of the operation will generally be faster. E.g., in the 128-day case, we saved 15s on the server at a cost of 16s on the client.
Since the resulting pack is 34MB smaller, this is a net win if the network speed is less than 270Mbit/s. And that's actually the worst case.
The 64-day case saves just over 11s at a cost of just under 11s. So it's a slight win at any network speed, and the 40MB saved is pure bonus.
That trend continues for the smaller fetches.


在Git 2.22(2019年第二季度)中,另一个选项将在重新打包方面有所帮助,现在默认情况下会创建路径名哈希缓存,以避免在重新打包时产生cr脚的变化.


With Git 2.22 (Q2 2019), another option will help on the repacking side, the the pathname hash-cache now created by default to avoid making crappy deltas when repacking.

请参见提交36eba03 (2019年3月14日)由提交d431660 Jeff King (peff).
(由 Junio C Hamano-gitster-

See commit 36eba03 (14 Mar 2019) by Eric Wong (ele828).
See commit d431660, commit 90ca149 (15 Mar 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 2bfb182, 13 May 2019)

pack-objects:默认为写入位图哈希缓存

pack-objects: default to writing bitmap hash-cache

启用 pack.writebitmaphashcache 应该始终是性能上的胜利.
每个磁盘上的每个对象仅花费4个字节,并且
ae4f07f (:实现可选的name_hash缓存,2013-12-21,Git v2.0.0-rc0)显示它 将获取和部分位图的克隆时间缩短40-50%.

Enabling pack.writebitmaphashcache should always be a performance win.
It costs only 4 bytes per object on disk, and the timings in ae4f07f (pack-bitmap: implement optional name_hash cache, 2013-12-21, Git v2.0.0-rc0) show it improving fetch and partial-bitmap clone times by 40-50%.

我们当时未默认启用它的唯一原因是 版本的JGit的位图阅读器抱怨存在 不了解的可选标头位.
但这在 JGit的d2fa3987a 中进行了更改(使用位检查来检查是否存在OPT_FULL选项,2013年10月30日),于2014年末加入JGit v3.5.0.

The only reason we didn't enable it by default at the time is that early versions of JGit's bitmap reader complained about the presence of optional header bits it didn't understand.
But that was changed in JGit's d2fa3987a (Use bitcheck to check for presence of OPT_FULL option, 2013-10-30), which made it into JGit v3.5.0 in late 2014.

因此,让我们默认启用此选项.
它与所有版本的Git向后兼容,如果您还在同一存储库上使用JGit,则使用将近5年的版本只会遇到问题.

So let's turn this option on by default.
It's backwards-compatible with all versions of Git, and if you are also using JGit on the same repository, you'd only run into problems using a version that's almost 5 years old.

我们将从所有测试脚本中删除手动设置,包括 性能测试.这不是严格必要的,但有两个优点:

We'll drop the manual setting from all of our test scripts, including perf tests. This isn't strictly necessary, but it has two advantages:

  1. 如果默认情况下永远停止启用哈希缓存,则我们的性能 回归测试会引起注意.

  1. If the hash-cache ever stops being enabled by default, our perf regression tests will notice.

我们可以使用修改后的perf测试来展示 否则未配置的存储库,如下所示.

We can use the modified perf tests to show off the behavior of an otherwise unconfigured repo, as shown below.

这些是针对linux.git的一些性能测试的结果, 显示出有趣的结果.
您可以在5310.4中看到预期的加速,这已在 ae4f07f 中指出(2013年12月) ,Git v2.0.0-rc0).
奇怪的是,尽管在 ae4f07f ae4f07f .我对此没有解释.

These are the results of a few of a perf tests against linux.git that showed interesting results.
You can see the expected speedup in 5310.4, which was noted in ae4f07f (Dec. 2013, Git v2.0.0-rc0).
Curiously, 5310.8 did not improve (and actually got slower), despite seeing the opposite in ae4f07f. I don't have an explanation for that.

p5311的测试当时还不存在,但确实有改进 (由于具有更好的增量,因此可以减少包装数量,而我们发现的时间更短).

The tests from p5311 did not exist back then, but do show improvements (a smaller pack due to better deltas, which we found in less time).

  Test                                    HEAD^                HEAD
  -------------------------------------------------------------------------------------
  5310.4: simulated fetch                 7.39(22.70+0.25)     5.64(11.43+0.22) -23.7%
  5310.8: clone (partial bitmap)          18.45(24.83+1.19)    19.94(28.40+1.36) +8.1%
  5311.31: server (128 days)              0.41(1.13+0.05)      0.34(0.72+0.02) -17.1%
  5311.32: size   (128 days)                         7.4M                 7.0M -4.8%
  5311.33: client (128 days)              1.33(1.49+0.06)      1.29(1.37+0.12) -3.0%


Git 2.23(2019年第三季度)确保在存在.keep文件时现在禁用了打包位图的生成,因为它们是互斥的功能.


Git 2.23 (Q3 2019) makes sure generation of pack bitmaps are now disabled when .keep files exist, as these are mutually exclusive features.

请参见提交7328482 (2019年6月29日)通过杰夫·金(peff).
(由 Junio C Hamano-gitster-

See commit 7328482 (29 Jun 2019) by Eric Wong (ele828).
Helped-by: Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit d60dc1a, 19 Jul 2019)

此修复了 36eba03 (重新打包:默认情况下,在裸仓库中启用位图" ,2019年3月,Git v2.22.0-rc0)

This fixes 36eba03 ("repack: enable bitmaps by default on bare repos", March 2019, Git v2.22.0-rc0)

重新打包:如果存在.keep文件,则默认禁用位图

位图不适用于多个包,并且默认情况下在裸仓库中启用位图时,具有.keep文件的用户最终会得到冗余包.

repack: disable bitmaps-by-default if .keep files exist

Bitmaps aren't useful with multiple packs, and users with .keep files ended up with redundant packs when bitmaps got enabled by default in bare repos.

因此,检测.keep文件何时存在,并停止启用位图 在这种情况下,默认情况下.

So detect when .keep files exist and stop enabling bitmaps by default in that case.

使用.keep文件的糟糕(但无害)竞争条件 杰夫·金(Jeff King)记录的内容仍然适用,我们有机会 仍然以FS上的冗余数据结束,此处讨论.

Wasteful (but otherwise harmless) race conditions with .keep files documented by Jeff King still apply and there's a chance we'd still end up with redundant data on the FS, as discussed here.

v2:避免在测试用例中使用subshel​​l,要了解多索引

v2: avoid subshell in test case, be multi-index aware


但是,当命令试图在没有用户明确要求的情况下生成打包位图时,相同的Git 2.23(2019年第三季度)会抑制重新打包"中不必要和误导性的警告.


However, the same Git 2.23 (Q3 2019) squelches unneeded and misleading warnings from "repack" when the command attempts to generate pack bitmaps without explicitly asked for by the user.

请参见提交7ff024e 杰夫·金(peff)提交cc2649a (2019年7月31日).
(由 Junio C Hamano-gitster-

See commit 7ff024e, commit 2557501, commit cc2649a (31 Jul 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 51cf315, 01 Aug 2019)

repack:简化自动位图和.keep文件的处理

提交7328482 (repack:如果.keep文件 存在,2019-06-29,Git v2.23.0-rc0)教导重新打包以更喜欢禁用位图来 复制对象(除非明确要求位图).

repack: simplify handling of auto-bitmaps and .keep files

Commit 7328482 (repack: disable bitmaps-by-default if .keep files exist, 2019-06-29, Git v2.23.0-rc0) taught repack to prefer disabling bitmaps to duplicating objects (unless bitmaps were asked for explicitly).

但是有一种更简单的方法:如果我们继续通过 自动启用位图时,将--honor-pack-keep标志标记为pack-objects,然后 pack-objects已经做出相同的决定(它将禁用位图 而不是重复).
更好的是,pack-objects实际上可以不仅根据.keep文件的存在来决定是否这样做,而且还取决于该.keep文件是否确实影响我们正在制造的新包装(因此,如果我们与例如,通过推送或提取,其临时.keep文件不会阻止我们生成位图(如果尚未更新其引用的话).

But there's an easier way to do this: if we keep passing the --honor-pack-keep flag to pack-objects when auto-enabling bitmaps, then pack-objects already makes the same decision (it will disable bitmaps rather than duplicate).
Better still, pack-objects can actually decide to do so based not just on the presence of a .keep file, but on whether that .keep file actually impacts the new pack we're making (so if we're racing with a push or fetch, for example, their temporary .keep file will not block us from generating bitmaps if they haven't yet updated their refs).

由于重新包装使用了--write-bitmap-index-quiet标志,因此我们不 不得不担心打包对象在生成时会产生混乱的警告 确实看到.keep文件.
我们可以通过调整.keep测试以检查重新包装的stderr来确认这一点.

And because repack uses the --write-bitmap-index-quiet flag, we don't have to worry about pack-objects generating confusing warnings when it does see a .keep file.
We can confirm this by tweaking the .keep test to check repack's stderr.

这篇关于对于高延迟磁盘,许多文件的Git提取速度很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆