从Git历史记录中删除大文件 [英] Removing big files from Git history

查看:129
本文介绍了从Git历史记录中删除大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了多个有关使用过滤器分支或BFG来完成此任务的建议,但是我觉得我还需要进一步的建议,因为我的情况有点特殊.

I've read multiple answers advising on using either filter-branch or BFG to accomplish this task, but I feel I need further advice because my situation is a bit peculiar.

我必须管理两个存储库,一个基本上是另一个存储库的副本,理想情况下,我想每天将更改从源存储到克隆中.但是,原始存储库在其历史记录中包含非常大的文件,超过了Github的大小限制.因此,我必须删除这些文件,但是同时,除了更改那些特定文件之外,我不想破坏现有的提交历史记录.据我了解,BFG会对历史进行完整的重写,这会使Github误以为所有现有文件都被删除并重新创建为新文件,而filter-branch并没有做到这一点,但是相比之下,它的运行速度非常慢,我的存储库非常大,可以提交约100000次提交...

I have to manage two repositories, one is basically a clone of the other, and ideally, I'd want to pull the changes from the origin into the clone on a daily basis. However, the origin repo contains very big files in its history, which are above Github's size limits. So I have to remove these files, but at the same time, I don't want to harm the existing commit history beyond the changes to those specific files. From what I understand, BFG performs a complete rewrite of the history, which will fool Github into thinking that all existing files were deleted and recreated as new files, whereas filter-branch doesn't do that, but it's also extremely slow by comparison, and my repository is very big reaching about 100000 commits...

因此,我正在尝试找出解决此问题的最佳方法.我应该在某些时候使用BFG,只是接受我会因为修改而看到可笑的请求,还是应该以某种方式使用过滤器分支?要澄清的是,只有3个文件是造成这种不满的原因.

So I'm trying to figure out what's the best way to go about this. Should I use BFG at certain points, and simply accept that I'm gonna see ridiculous pull requests as a result of its modifications, or maybe I should use filter-branch in some manner? To clarify, there are only 3 files which are the cause of this grievance.

推荐答案

在Git中提交历史只是提交.

Commit history in Git is nothing but commits.

任何提交都无法更改.因此,为了让任何内容从某个现有提交中删除大文件,无论是BFG还是 git filter-branch git filter-repo 或其他方式-将必须提取错误的"提交,进行一些更改(例如,删除大文件),并进行新的改进的替代提交.

No commit can ever be changed. So for anything to remove a big file from some existing commit, that thing—whether it's BFG, or git filter-branch, or git filter-repo, or whatever—is going to have to extract a "bad" commit, make some changes (e.g., remove the big file), and make a new and improved substitute commit.

最糟糕的是,每个后续提交都以不可更改的方式对 bad 提交的原始哈希ID进行编码.错误提交的直接子代将其编码为其父哈希.因此,您(或工具)必须将那些提交复制到新的和改进的文件中.它们的改进之处在于,它们缺少大文件,并且可以参考他们刚刚为最初的错误提交所做的替换.

The terrible part of this is that each subsequent commit encodes, in an unchangeable way, the raw hash ID of the bad commit. The immediate children of the bad commit encode it as their parent hash. So you—or the tool—must copy those commits to new-and-improved ones. What's improved about them is that they lack the big file and refer back to the replacement they just made for the initial bad commit.

当然,他们的孩子将哈希ID编码为父哈希ID,因此现在该工具必须复制这些提交.这一直重复到每个分支中的 last 提交为止,由分支名称标识:

Of course, their children encode their hash IDs as parent hash IDs, so now the tool must copy those commits. This repeats all the way up to the last commit in each branch, as identified by the branch name:

...--o--o--x--o--o--o   [old, bad version of branch]
         \
          ●--●--●--●   <-- branch

其中 x 是错误的提交: x 必须复制到第一个新的和改进的上,但是随后的所有后续提交也必须被复制.

where x is the bad commit: x had to be copied to the first new-and-improved but then all subsequent commits had to be copied too.

作为不同提交的副本具有不同的哈希ID.每个克隆现在都必须放弃错误"的提交( x 一个及其所有后代),而应使用新的和改进的提交

The copies, being different commits, have different hash IDs. Every clone must now abandon the "bad" commits—the x one and all its descendants—in favor of the new-and-improved ones.

所有这些存储库编辑工具都应努力进行最小的更改.BFG可能是最快,最方便使用的方法,但是可以告诉 git filter-branch 仅复制 所有坏消息并使用--index-filter ,它是最快(仍然很慢!)的过滤器.为此,请使用:

All these repository-editing tools should strive to make minimal changes. The BFG is probably the fastest and most convenient to use, but git filter-branch can be told to copy only all bad-and-descendant commits and to use --index-filter, which is its fastest (still slow!) filter. To do this, use:

git filter-branch --index-filter <command> -- <hash>..branch1 <hash>..branch2 ...

其中< command> 是适当的"git rm --cached --ignore-unmatch" 命令(一定要引用整个内容),并且< hash> 和分支名称指定要复制的提交.请记住, A..B 语法意味着不要查看提交 A 或更早的版本,而要查看提交 B 或更早的版本,因此,如果提交 x deadbeefbadf00d ... ,则需要使用其 parent 的哈希作为限制器:

where the <command> is an appropriate "git rm --cached --ignore-unmatch" command (be sure to quote the whole thing) and the <hash> and branch names specify which commits to copy. Remember that A..B syntax means don't look at commit A or earlier, while looking at commits B and earlier so if commit x is, say, deadbeefbadf00d..., you'll want to use the hash of its parent as the limiter:

git filter-branch --index-filter "..." -- deadbeefbadf00d^..master

例如

(使用正确的删除命令填充 ... 部分).

(注意:我实际上没有使用过BFG,但是如果它不必要地重新复制了提交,那真的很糟糕,我敢打赌不会.)

(Note: I have not actually used The BFG, but if it re-copies commits unnecessarily, that's really bad, and I bet it does not.)

这篇关于从Git历史记录中删除大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆