重新打包存储库对大型二进制文件有用吗? [英] Is repacking a repository useful for large binaries?

查看:88
本文介绍了重新打包存储库对大型二进制文件有用吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将大量历史记录从Perforce转换为Git,并且一个文件夹(现为git分支)包含大量的大型二进制文件。我的问题是我在运行 git gc --aggressive 时内存不足。

I'm trying to convert a large history from Perforce to Git, and one folder (now git branch) contains a significant number of large binary files. My problem is that I'm running out of memory while running git gc --aggressive.

我的主要问题是重新打包存储库是否可能对大型二进制文件产生任何有意义的影响。再压缩20%会很棒。 0.2%不值得我付出努力。如果没有,我会按照建议在此处跳过它们。

My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries. Compressing them another 20% would be great. 0.2% isn't worth my effort. If not, I'll have them skipped over as suggested here.

对于背景,我成功地使用 git p4 在我满意的状态下创建了存储库,但这使用了 git快速导入在幕后,因此我想在正式宣布存储库之前对其进行优化,并且确实进行任何提交都会自动触发缓慢的 gc --auto 。目前,裸露状态下的容量约为35GB。

For background, I successfully used git p4 to create the repository in a state I'm happy with, but this uses git fast-import behind the scenes so I want to optimize the repository before making it official, and indeed making any commits automatically triggered a slow gc --auto. It's currently ~35GB in a bare state.

从概念上讲,该二进制文件似乎是嵌入式设备中使用的供应商固件。我认为400-700MB范围内大约有25个,而20-50MB范围内可能还有数百个。它们可能是磁盘映像,但我不确定。随着时间的推移,版本和文件类型多种多样,我看到了 .zip tgz .simg 文件。因此,我希望原始代码有很多重叠,但是我不确定此时实际文件的外观如何相似,因为我相信这些格式已经被压缩了,对吧?

The binaries in question seem to be, conceptually, the vendor firmware used in embedded devices. I think there are approximately 25 in the 400-700MB range and maybe a couple hundred more in the 20-50MB range. They might be disk images, but I'm unsure of that. There's a variety of versions and file types over time, and I see .zip, tgz, and .simg files frequently. As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?

这些二进制文件包含在一个(旧的)分支中,该分支将很少被过度使用(以至于质疑版本控制是完全有效的,但超出了范围)。当然,该分支的性能不需要很高。但是我希望存储库的其余部分合理。

These binaries are contained in one (old) branch that will be used excessively rarely (to the point questioning version control at all is valid, but out of scope). Certainly the performance of that branch does not need to be great. But I'd like the rest of the repository to be reasonable.

欢迎其他有关最佳包装或内存管理的建议。我承认我不太了解链接问题上讨论的各种git选项。我也不真正理解-window -depth 标志在中的作用git repack 。但是主要的问题是重新打包二进制文件本身是否在做有意义的事情。

Other suggestions for optimal packing or memory management are welcome. I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the --window and --depth flags are doing in git repack. But the primary question is whether the repacking of the binaries themselves is doing anything meaningful.

推荐答案


我的这里的主要问题是重新打包存储库是否可能对大型二进制文件产生有意义的影响。

My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries.

这取决于它们的内容。对于您具体列出的文件:

That depends on their contents. For the files you've outlined specifically:


我经常看到.zip,tgz和.simg文件。

I see .zip, tgz, and .simg files frequently.

Zipfiles和tgz(压缩后的tar存档)文件已被压缩,并且具有可怕的(即高)香农熵值-对于Git来说很糟糕-并且不会相互压缩。 .simg 文件可能是(我必须在这里猜测)奇异磁盘映像文件;我不知道它们是否被压缩以及如何被压缩,但是我认为它们是被压缩的。 (一个简单的测试是将一个文件馈送到gzip之类的压缩器中,看看它是否收缩。)

Zipfiles and tgz (gzipped tar archive) files are already compressed and have terrible (i.e., high) Shannon entropy values—terrible for Git that is—and will not compress against each other. The .simg files are probably (I have to guess here) Singularity disk image files; whether and how they are compressed, I don't know, but I would assume they are. (An easy test is to feed one to a compressor, e.g., gzip, and see if it shrinks.)


d希望原始代码有很大的重叠,但是我不确定此时实际文件的外观如何相似,因为我相信这些格式已经被压缩了,对吗?

As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?

准确。因此,将它们未压缩存储在Git中,反而会导致最终更大的压缩。 (但是打包可能需要大量内存。)

Precisely. Storing them uncompressed in Git would thus, paradoxically, result in far greater compression in the end. (But the packing could require significant amounts of memory.)


如果[这可能是徒劳的],我将跳过它们,因为建议在此处

成为我的第一个冲动。 :-)

That would be my first impulse here. :-)


我承认我不太了解链接问题上讨论的各种git选项。我也不真正理解-window -depth 标志在中的作用git repack

各种限制令人迷惑(和充斥)。同样重要的是要意识到它们不会在克隆上被复制,因为它们位于 .git / config 中,这不是已提交的文件,因此不会选择新的克隆他们起来。 .gitattributes 文件 已复制到克隆上,新克隆将继续避免打包无法打包的文件,因此这是更好的方法。

The various limits are confusing (and profuse). It's also important to realize that they don't get copied on clone, since they are in .git/config which is not a committed file, so new clones won't pick them up. The .gitattributes file is copied on clone and new clones will continue to avoid packing unpackable files, so it's the better approach here.

(如果您想深入了解细节,可以在 Git技术文档。它没有确切讨论窗口大小的含义,但与Git用于内存映射的内存量有关。选择彼此之间可能很好压缩的对象时的对象数据,有两个:一个用于一个包文件中的每个单独的mmap,一个用于所有包文件中的总计mmap。在链接上未提及: core.deltaBaseCacheLimit ,这是用于容纳增量基数的内存,但是要了解这一点,您需要使用增量压缩和增量链, 1 并阅读相同的技术文档,请注意,Git wil l默认情况下,不尝试打包任何大小超过 core.bigFileThreshold 的文件对象。各种 pack。* 控件稍微复杂一些:打包是多线程完成的,以便在可能的情况下利用所有CPU,并且每个线程可以使用很多记忆。限制线程数将限制总内存使用:如果一个线程要使用256 MB,则8个线程可能会使用8 * 256 = 2048 MB或2 GB。位图主要是加快从繁忙服务器中获取数据的速度。)

(If you care to dive into the details, you will find some in the Git technical documentation. This does not discuss precisely what the window sizes are about, but it has to do with how much memory Git uses to memory-map object data when selecting objects that might compress well against each other. There are two: one for each individual mmap on one pack file, and one for the total aggregate mmap on all pack files. Not mentioned on your link: core.deltaBaseCacheLimit, which is how much memory will be used to hold delta bases—but to understand this you need to grok delta compression and delta chains,1 and read that same technical documentation. Note that Git will default to not attempting to pack any file object whose size exceeds core.bigFileThreshold. The various pack.* controls are a bit more complex: the packing is done multi-threaded to take advantage of all your CPUs if possible, and each thread can use a lot of memory. Limiting the number of threads limits total memory use: if one thread is going to use 256 MB, 8 threads are likely to use 8*256 = 2048 MB or 2 GB. The bitmaps mainly speed up fetching from busy servers.)

1 复杂:当一个对象说获取对象XYZ并应用这些更改,但是对象XYZ本身说获取对象PreXYZ并应用这些更改时,就会发生增量链。对象PreXYZ也可以获取另一个对象,依此类推。 delta base 是此列表底部的对象。

1They're not that complicated: a delta chain occurs when one object says "take object XYZ and apply these changes", but object XYZ itself says "take object PreXYZ and apply these changes". Object PreXYZ can also take another object, and so on. The delta base is the object at the bottom of this list.

这篇关于重新打包存储库对大型二进制文件有用吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆