在git中版本化大文本文件 [英] Versioning large text files in git

查看:93
本文介绍了在git中版本化大文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用git一段时间来进行源代码控制,我非常喜欢它。所以我开始研究使用git来存储大量的二进制文件,我发现这只是git的一杯茶。那么大文本文件怎么样?看起来git应该处理这些,但我也遇到了问题。



我使用550mb大小的mbox样式文本文件测试了这一点。我git启动了一个新的回购来做到这一点。这里是我的结果:


  • git add和git commit - 总回购大小为306mb - 回购包含一个306mb大小的对象

    li>
  • 添加一封邮件到邮箱文件和git commit - 总回购大小为611mb - 回购包含两个对象,每个大小为306mb
  • 再添加一个电子邮件到邮箱文件和git提交 - 总回购大小为917mb - 回购包含三个对象,每个306MB大小



因此,每个提交将邮箱文件的新副本添加到回购。现在我想尝试将回购的规模降低到可以控制的程度。以下是我的结果:


  • git repack -adf - 总回购大小为877mb - 回购包含一个876mb大小的包装文件 li>
  • git gc --aggressive - 总回购尺寸为877mb - 回购单包含一个876mb大小的包装文件


我希望能够将回购规模缩小到306mb左右,但我无法弄清楚。任何更大的数据似乎都会被存储。



我的希望是,回购只会增加收到的新邮件的大小,而不会增加大小的整个邮箱。我没有试图在这里版本控制电子邮件,但这似乎是我使用夜间脚本增量备份用户主目录的一大阻碍。



任何建议如何在插入少量文本到一个非常大的文本文件的末尾时保持回购大小不变?



我曾看过 bup 和git annex,但如果可能的话,我真的很喜欢用普通的旧git。



感谢您的帮助!

解决方案

我不认为git会做一般来说,在存储三角洲方面做得很好,即使你可以这样做,它也不会是确定性的。这就是说,基于 http://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/ ,你可能要尝试 git repack -a -d --depth = 250 --window = 250



我怀疑你最好的选择是使用git --rebase截断你的历史记录,并且只存储过去的几个备份。你可以使用git分支来做到这一点。建立一个年度,月度和日常分支机构。每天,每天提交,然后使用 git rebase - 去掉HEAD〜4 HEAD〜3每日删除超过3天的备份。在每周的第一天,每周结账并 git cherry-pick daily ,然后执行相同的git rebase删除每周超过3周的备份。最后,在每年的第一天,遵循类似的过程。你可能会希望每次在这个序列之后都做一个 git gc 来释放旧空间。



但是如果你这样做,你就不再利用git,并且滥用其相当数量的方式。我认为最适合你的备份解决方案不涉及混帐。


I've used git for awhile for source control and I really like it. So I started investigating using git to store lots of large binary files, which I'm finding just isn't git's cup of tea. So how about large text files? It seems like git should handle those just fine, but I'm having problems with that too.

I'm testing this out using a 550mb size mbox style text file. I git init'ed a new repo to do this. Here are my results:

  • git add and git commit - total repo size is 306mb - repo contains one object that is 306mb in size
  • add one email to the mailbox file and git commit - total repo size is 611mb - repo contains two objects that are each 306mb in size
  • add one more email to the mailbox file and git commit - total repo size is 917mb - repo contains three objects that are each 306mb in size

So every commit adds a new copy of the mailbox file to the repo. Now I want to try to get the size of the repo down to something manageable. Here are my results:

  • git repack -adf - total repo size is 877mb - repo contains one pack file that is 876mb in size
  • git gc --aggressive - total repo size is 877mb - repo contains one pack file that is 876mb in size

I would expect to be able to get the repo down in size to something around 306mb, but I can't figure out how. Anything larger seems like a lot of duplicate data is being stored.

My hope is that the repo would only increase by the size of the new email received, not by the size of the entire mailbox. I'm not trying to version control email here, but this seems to be my big hold back from using a nightly script to incrementally back up users' home directories.

Any advice in how to keep the repo size from blowing up when inserting a small amount of text to the end of a very large text file?

I've looked at bup and git annex, but I'd really like to stick with just plain old git if possible.

Thank you for your help!

解决方案

I don't think git will do a good job at storing deltas in general, and even if you can finagle it to do so, it won't be deterministic. That said, based on http://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/, you may want to try git repack -a -d --depth=250 --window=250.

I suspect your best option is to truncate your history using git --rebase, and only store the past few backups. You could do this using git branches. Make a branch called yearly, monthly, and daily. Every day, commit to daily, then use git rebase --onto HEAD~4 HEAD~3 daily to delete backups older than 3 days old. On the first day of every week, checkout weekly and git cherry-pick daily, then do the same git rebase to remove weekly backups older than 3 weeks. Finally, on the first day of every year, follow a similar process. You will probably want to do a git gc after this sequence each time, to free up the old space.

But if you're doing this, you're not taking advantage of git anymore and abusing the way it works a fair amount. I think the best backup solution for you does not involve git.

这篇关于在git中版本化大文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆