如果不重写历史记录,是否可以瘦身.git存储库? [英] Is it possible to slim a .git repository without rewriting history?

查看:223
本文介绍了如果不重写历史记录,是否可以瘦身.git存储库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一些 git 版本库,由于历史包含二进制测试文件和java .jar files。



我们即将完成 git filter-branch 将这些存储库重新克隆到每个地方(从每次数十次到数百次,具体取决于回购),并给出重写历史记录的问题我想知道是否有其他解决方案。



理想情况下,我想在不重写每个存储库的历史记录的情况下将问题文件外化。理论上这应该是可能的,因为你正在检出相同的文件,具有相同的大小和相同的哈希,只是从不同的地方(远程而不是本地对象存储)获取它们。唉,迄今为止我找到的潜在解决方案似乎都不允许我这样做。



git-annex ,我能找到的解决方案最接近我的问题是如何追溯附件已经在git repo中的文件,但就像删除大文件一样,这需要重写历史记录以将原始 git add git annex add 中。



从那里开始,我开始考虑列在什么git附件不是,所以我检查了 git-bigfiles git-media git-fat <​​/a>。不幸的是,我们不能使用 git git-bigfiles 分支,因为我们是一家Eclipse商店 并使用混合 git EGit 。它看起来不像 git-media git-fat <​​/ em>可以做我想要的,因为虽然可以用外部等价物替换现有的大型文件,但仍然可以需要重写历史记录以删除已经提交的大文件。



那么,是否可以在不改写历史记录的情况下减少.git存储库,还是应该回到使用 git filter-branch 和整个重新部署负载的计划?






另外,相信这个应该是可能的,但可能与 git 的限制相同。目前的浅层克隆实施。

Git已经支持同一个blob的多个可能位置,因为任何给定的blob都可以在松散对象存储区 .git / objects )或者 pack file (.git / objects)理论上你只需要像 git-annex 可以在该级别上挂钩而不是高级挂钩(即如果你愿意,可以有一个按需下载远程blob的概念)。不幸的是,我找不到任何人已经实施,甚至提出这样的建议。

您可以使用 Git的替换功能来留出如此庞大的历史记录只有在需要时才会下载。它就像一个浅层克隆,但没有浅层克隆的限制。



这个想法是通过创建一个新的根提交重新启动一个分支,然后樱桃挑选旧分支的提示承诺。通常你会以这种方式失去所有的历史记录(这也意味着你不必克隆那些大的 .jar 文件),但是如果需要历史记录,你可以获取历史提交并使用 git replace 无缝地将它们缝合。



请参阅斯科特·查孔的优秀博客文章,以获得详细的解释和演练。



这种方法的优点:$ b​​
$ b


  • 历史不会被修改。如果你需要返回一个较旧的提交,并且它的大 .jars 和所有东西,你仍然可以。

  • 如果你不愿意您不需要查看旧的历史记录,本地克隆的大小非常小,您创建的任何新克隆都不需要下载大量无用的数据。

>

此方法的缺点:


  • 完整的历史记录默认不可用—用户需要如果您确实需要频繁访问历史记录,那么您最终会下载臃肿的提交。

  • 这种方法仍然存在与重写历史相同的问题。例如,如果您的新资源库看起来像这样:

      *修改栏(主)
    |
    *修改foo< - 替换 - > *修改foo(历史/主)
    | |
    *指示*删除所有大的.jar文件
    |
    *添加另一个jar
    |
    *修改一个jar
    |

    并且某人有一个旧分支关闭它们合并的历史分支:

      *合并xyz到master(master)
    | \ __________________________
    | \
    *修改栏*添加功能xyz
    | |
    *修改foo< - 替换 - > *修改foo(历史/主)
    | |
    *指示*删除所有大的.jar文件
    |
    *添加另一个jar
    |
    *修改一个jar
    |

    那么大的历史提交将重新出现在您的主存储库中,并且您又回到了开始的位置。请注意,这不会比重写历史记录更糟糕 - 有人可能会在重写前提交中意外合并。



    这可以通过添加更新来缓解在您的共享存储库中挂钩以拒绝将重新引入历史根提交的任何推送。



We have a number of git repositories which have grown to an unmanageable size due to the historical inclusion of binary test files and java .jar files.

We are just about to go through the exercise of git filter-branching these repositories, re-cloning them everywhere they are used (from dozens to hundreds of deployments each, depending on the repo) and given the problems with rewriting history I was wondering if there might be any other solutions.

Ideally I would like to externalise problem files without rewriting the history of each repository. In theory this should be possible because you are checking out the same files, with the same sizes and the same hashes, just sourcing them from a different place (a remote rather than the local object store). Alas none of the potential solutions I have found so far appear to allow me to do this.

Starting with git-annex, the closest I could find to a solution to my problem was How to retroactively annex a file already in a git repo, but as with just removing the large files, this requires the history to be re-written to convert the original git add into a git annex add.

Moving on from there, I started looking at other projects listed on what git-annex is not, so I examined git-bigfiles, git-media and git-fat. Unfortunately we can't use the git-bigfiles fork of git since we are an Eclipse shop and use a mixture of git and EGit. It doesn't look like git-media or git-fat can do what I want either, since while you could replace existing large files with the external equivalents, you would still need to rewrite the history in order to remove large files which had already been committed.

So, is it possible to slim a .git repository without rewriting history, or should we go back to the plan of using git filter-branch and a whole load of redeployments?


As an aside, believe that this should be possible, but is probably tied to the same limitations as those of gits current shallow clone implementation.

Git already supports multiple possible locations for the same blob, since any given blob could be in the loose object store (.git/objects) or in a pack file (.git/objects) so theoretically you would just need something like git-annex to be hooked in at that level rather than higher up (i.e. have the concept of a download on demand remote blob if you like). Unfortunately I can't find anyone having implemented or even suggested anything like this.

解决方案

Sort of. You can use Git's replace feature to set aside the big bloated history so that it is only downloaded if needed. It's like a shallow clone, but without a shallow clone's limitations.

The idea is you reboot a branch by creating a new root commit, then cherry-pick the old branch's tip commit. Normally you would lose all of the history this way (which also means you don't have to clone those big .jar files), but if the history is needed you can fetch the historical commits and use git replace to seamlessly stitch them back in.

See Scott Chacon's excellent blog post for a detailed explanation and walk-through.

Advantages of this approach:

  • History is not modified. If you need to go back to an older commit complete with it's big .jars and everything, you still can.
  • If you don't need to look at the old history, the size of your local clone is nice and small, and any fresh clones you make won't require downloading tons of mostly-useless data.

Disadvantages of this approach:

  • The complete history is not available by default—users need to jump through some hoops to get at the history.
  • If you do need frequent access to the history, you'll end up downloading the bloated commits anyway.
  • This approach still has some of the same problems as rewriting history. For example, if your new repository looks like this:

    * modify bar (master)
    |
    * modify foo  <--replace-->  * modify foo (historical/master)
    |                            |
    * instructions               * remove all of the big .jar files
                                 |
                                 * add another jar
                                 |
                                 * modify a jar
                                 |
    

    and someone has an old branch off of the historical branch that they merge in:

    * merge feature xyz into master (master)
    |\__________________________
    |                           \
    * modify bar                 * add feature xyz
    |                            |
    * modify foo  <--replace-->  * modify foo (historical/master)
    |                            |
    * instructions               * remove all of the big .jar files
                                 |
                                 * add another jar
                                 |
                                 * modify a jar
                                 |
    

    then the big historical commits will reappear in your main repository and you're back to where you started. Note that this is no worse than rewriting history—someone might accidentally merge in the pre-rewrite commits.

    This can be mitigated by adding an update hook in your shared repository to reject any pushes that would reintroduce the historical root commit(s).

这篇关于如果不重写历史记录,是否可以瘦身.git存储库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆