Git存储文本和二进制文件的方式之间有区别吗 [英] Is there a difference between how Git stores text and binary files

查看:97
本文介绍了Git存储文本和二进制文件的方式之间有区别吗的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

每个人似乎都同意的一件事是,对于大的二进制blob,Git并不是很好.请记住,二进制blob与大型文本文件不同;您可以毫无问题地在大型文本文件上使用Git,但是Git对于不可渗透的二进制文件不能做很多事情,只能将它当作一个大的黑匣子并按原样提交.

One thing everyone seems to agree on is Git is not great for big binary blobs. Keep in mind that a binary blob is different from a large text file; you can use Git on large text files without a problem, but Git can't do much with an impervious binary file except treat it as one big solid black box and commit it as-is.

根据 https://opensource.com/life/16/8/how-manage-binary-blobs-git-part-7 :

每个人似乎都同意的一件事是,Git并非一帆风顺二进制斑点请注意,二进制Blob与大文本文件;您可以在大型文本文件上使用Git,而无需问题,但是Git对于不可渗透的二进制文件不能做很多事情,除了将其视为一个大的黑匣子,并按原样提交.

One thing everyone seems to agree on is Git is not great for big binary blobs. Keep in mind that a binary blob is different from a large text file; you can use Git on large text files without a problem, but Git can't do much with an impervious binary file except treat it as one big solid black box and commit it as-is.

假设您拥有令人兴奋的新第一人称呼的复杂3D模型您正在制作的益智游戏,并以二进制格式保存,产生一个1 GB的文件.您git提交一次,添加一个到存储库历史记录的千兆字节.稍后,您给模型一个不同的发型,并提交您的更新;git无法分辨头发除了模型的头部或其他部分,因此您已经做出了承诺另一个千兆字节.然后您更改模型的眼睛颜色并提交这么小的变化:另一个千兆字节.一台三千兆字节的模型,一时兴起地做了一些小改动.扩展到所有游戏中的资产,并且您遇到了严重的问题.

Say you have a complex 3D model for the exciting new first person puzzle game you're making, and you save it in a binary format, resulting in a 1 gigabyte file. You git commit it once, adding a gigabyte to your repository's history. Later, you give the model a different hair style and commit your update; Git can't tell the hair apart from the head or the rest of the model, so you've just committed another gigabyte. Then you change the model's eye color and commit that small change: another gigabyte. That is three gigabytes for one model with a few minor changes made on a whim. Scale that across all the assets in a game, and you have a serious problem.

据我了解,文本文件和二进制文件之间没有区别,Git会完整存储每个提交的所有文件(创建一个校验和的Blob),而未更改的文件只会指向一个已经存在的Blob.我不知道其细节如何存储和压缩所有这些blob,这是另一个问题,但是我会假设,如果引号中的各个1GB文件或多或少都相同,那么一个好的压缩算法可以解决这个问题并且如果它们是重复的,则可能可以全部存储在不足1GB的空间中.这种推理应适用于二进制文件和文本文件.

It was my understanding that there is no difference between text and binary files and Git stores all files of each commit in their entirety (creating a checksummed blob), with unchanged files simply pointing to an already existing blob. How all those blobs are stored and compressed is another question, that I do not know the details of, but I would have assumed that if the various 1GB files in the quote are more or less the same, a good compression algorithm would figure this out and may be able to store all of them in even less than 1GB total, if they are repetitive. This reasoning should apply to binary as well as to text files.

与此相反,引文继续说

将其与.obj格式的文本文件进行对比.一提交存储一切,与其他模型一样,但是.obj文件是一系列的描述模型顶点的纯文本行.如果你修改模型并将其保存回.obj,Git可以读取两者文件,逐行创建更改的差异,并公平地处理小承诺.模型越精细,则越小commit get,这是标准的Git用例.这是一个很大的文件,但是它使用一种叠加或稀疏存储方法来构建完整的数据当前状态的图片.

Contrast that to a text file like the .obj format. One commit stores everything, just as with the other model, but an .obj file is a series of lines of plain text describing the vertices of a model. If you modify the model and save it back out to .obj, Git can read the two files line by line, create a diff of the changes, and process a fairly small commit. The more refined the model becomes, the smaller the commits get, and it's a standard Git use case. It is a big file, but it uses a kind of overlay or sparse storage method to build a complete picture of the current state of your data.

我的理解正确吗?报价不正确吗?

Is my understanding correct? Is the quote incorrect?

推荐答案

您是正确的,文本和二进制文件实际上只是blob对象.如果故事只有这些,那么事情会更简单,但事实并非如此,事实并非如此.:-)

You're right in that text and binary files are really just blob objects. If that were all there was to the story, things would be simpler, but it isn't, so they aren't. :-)

(您还可以指示Git对输入文件执行各种过滤操作.在这方面,文本和二进制文件在过滤器的作用方面没有区别,但是有默认情况下应用过滤器的方面有所不同:如果您使用自动模式,则Git会过滤 Git 认为是文本而不是文本的文件-过滤Git认为是二进制的文件,但这仅在使用自动检测和仅CRLF/LF的行尾转换时才重要.)

(You can also instruct Git to perform various filtering operations on input files. Here again, there's no difference between text and binary files in terms of what the filters do, but there is a difference in terms of when filters are applied by default: If you use the automatic mode, Git will filter a file that Git thinks is text, and not-filter a file that Git thinks is binary. But that only matters if you use the automatic detection and CRLF / LF-only line ending conversions.)

我会假设,如果报价中的各个1GB文件大致相同,那么一个好的压缩算法可以解决这个问题,并且如果它们全部都小于1GB,则可以全部存储.重复...

I would have assumed that if the various 1GB files in the quote are more or less the same, a good compression algorithm would figure this out and may be able to store all of them in even less than 1GB total, if they are repetitive ...

也许,也许不是.Git有两个单独的压缩算法.正如努法尔·易卜拉欣所说的,这两者之一是 zlib ,它适用于所有内容.

Maybe, and maybe not. Git has two separate compression algorithms. As Noufal Ibrahim said, one of these two—delta compression—is applied only in what Git calls pack files. The other one is zlib, which is applied to everything.

Zlib是一种通用的压缩算法,它依赖于特定的建模过程(请参见是否有用于完美"压缩的算法?用于背景).它在纯文本上的性能往往很好,而在某些二进制文件上则不太好.它倾向于使已经压缩的文件更大,因此,如果您已经压缩了1 GB的输入,则在zlib compresson之后它们可能会(略大)变大.但是,所有这些都是通用性.要了解它如何处理您的特定数据,诀窍是对您的特定数据运行它.

Zlib is a general compression algorithm and relies on a particular modeling process (see Is there an algorithm for "perfect" compression? for background). It tends to perform pretty well on plain text, and not so well on some binaries. It tends to make already-compressed files bigger, so if your 1 GB inputs are already-compressed, they are likely to be (marginally) larger after zlib compresson. But all of these are generalities; to find out how it works on your specific data, the trick is to run it on your specific data.

Git使用的增量编码发生在zlib压缩之前,并且确实适用于二进制数据.从本质上讲,它会找到在较早"和较晚"对象中匹配的长字节二进制序列(此处对较早"和较晚"进行了宽松的定义,但出于某些原因,Git对对象施加了特定的遍历和比较顺序)在此处)中进行讨论,并在可能的情况下替换N字节的一些长序列,带有引用较早的对象,从偏移量O抓取N字节".

The delta encoding that Git uses happens "before" zlib compression, and does work with binary data. Essentially, it finds long binary sequences of bytes that match in an "earlier" and "later" object (with "earlier" and "later" being rather loosely defined here, but Git imposes a particular walk and compare order on the objects for reasons discussed here) and if possible, replaces some long sequence of N bytes with "referring to earlier object, grab N bytes from offset O".

如果您在大型二进制文件上尝试此操作,事实证明,它通常在具有某种数据局部性的成对相关,大型,未压缩二进制文件对上效果很好,例如稍后"二进制文件往往具有早期"文件的许多长重复,并且在大型压缩二进制文件或表示要获取的数据结构的二进制文件中,非常严重混洗太多(以至于重复的二进制字符串变得非常零散,即不再有 long 了).因此,这再次取决于数据:请尝试您的特定数据,看看它是否对您有效.

If you try this on large binary files, it turns out that it generally works pretty well on pairs of related, large, uncompressed binary files that have some kind of data locality, as the "later" binary file tends to have a lot of long repeats of the "earlier" file, and very badly on large compressed binary files, or binary files that represent data structures that get shuffled about too much (so that the repeated binary strings have become very fragmented, i.e., none are long any more). So once again, it's quite data-dependent: try in on your specific data to see if it works well for you.

这篇关于Git存储文本和二进制文件的方式之间有区别吗的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆