从 Git 历史记录中删除大文件 [英] Removing big files from Git history

查看:37
本文介绍了从 Git 历史记录中删除大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了多个关于使用 filter-branch 或 BFG 来完成这项任务的建议,但我觉得我需要进一步的建议,因为我的情况有点奇怪.

I've read multiple answers advising on using either filter-branch or BFG to accomplish this task, but I feel I need further advice because my situation is a bit peculiar.

我必须管理两个存储库,一个基本上是另一个的克隆,理想情况下,我希望每天将更改从源提取到克隆中.但是,原始存储库在其历史记录中包含非常大的文件,超出了 Github 的大小限制.所以我必须删除这些文件,但同时,除了对这些特定文件的更改之外,我不想损害现有的提交历史.据我了解,BFG 对历史进行了完全重写,这会让 Github 认为所有现有文件都被删除并重新创建为新文件,而 filter-branch 不会这样做,但相比之下,它也非常慢,我的存储库非常大,大约有 100000 次提交......

I have to manage two repositories, one is basically a clone of the other, and ideally, I'd want to pull the changes from the origin into the clone on a daily basis. However, the origin repo contains very big files in its history, which are above Github's size limits. So I have to remove these files, but at the same time, I don't want to harm the existing commit history beyond the changes to those specific files. From what I understand, BFG performs a complete rewrite of the history, which will fool Github into thinking that all existing files were deleted and recreated as new files, whereas filter-branch doesn't do that, but it's also extremely slow by comparison, and my repository is very big reaching about 100000 commits...

所以我正在尝试找出解决此问题的最佳方法.我应该在某些时候使用 BFG,并简单地接受我会因为它的修改而看到荒谬的拉取请求,或者我应该以某种方式使用 filter-branch?澄清一下,只有 3 个文件是造成这种不满的原因.

So I'm trying to figure out what's the best way to go about this. Should I use BFG at certain points, and simply accept that I'm gonna see ridiculous pull requests as a result of its modifications, or maybe I should use filter-branch in some manner? To clarify, there are only 3 files which are the cause of this grievance.

推荐答案

Git 中的提交历史只是提交.

Commit history in Git is nothing but commits.

任何提交都不能更改.所以对于任何从现有提交中删除一个大文件,那个东西——无论是BFG,还是git filter-branch,还是git filter-repo,或者其他什么——将不得不提取一个坏"提交,进行一些更改(例如,删除大文件),并进行新的和改进的替代提交.

No commit can ever be changed. So for anything to remove a big file from some existing commit, that thing—whether it's BFG, or git filter-branch, or git filter-repo, or whatever—is going to have to extract a "bad" commit, make some changes (e.g., remove the big file), and make a new and improved substitute commit.

最糟糕的是,每个后续提交都以不可更改的方式对错误提交的原始哈希 ID 进行编码.错误提交的直接子代将其编码为它们的父哈希.因此,您或工具必须将那些承诺复制到新的和改进的承诺中.它们的改进之处在于它们缺少大文件并且回溯到它们刚刚为初始错误提交所做的替换.

The terrible part of this is that each subsequent commit encodes, in an unchangeable way, the raw hash ID of the bad commit. The immediate children of the bad commit encode it as their parent hash. So you—or the tool—must copy those commits to new-and-improved ones. What's improved about them is that they lack the big file and refer back to the replacement they just made for the initial bad commit.

当然,他们的孩子将他们的哈希 ID 编码为父哈希 ID,因此现在该工具必须复制这些提交.这一直重复到每个分支中的最后提交,由分支名称标识:

Of course, their children encode their hash IDs as parent hash IDs, so now the tool must copy those commits. This repeats all the way up to the last commit in each branch, as identified by the branch name:

...--o--o--x--o--o--o   [old, bad version of branch]
         
          ●--●--●--●   <-- branch

其中 x 是错误的提交:x 必须被复制到第一个新的和改进的 但随后所有的提交也必须复制.

where x is the bad commit: x had to be copied to the first new-and-improved but then all subsequent commits had to be copied too.

作为不同提交的副本具有不同的哈希 ID.每个克隆现在必须放弃坏"提交——x一个及其所有后代——转而支持新的和改进的提交.

The copies, being different commits, have different hash IDs. Every clone must now abandon the "bad" commits—the x one and all its descendants—in favor of the new-and-improved ones.

所有这些存储库编辑工具都应努力进行最少的更改.BFG 可能是最快和最方便使用的,但是 git filter-branch 可以被告知 only 复制所有错误和后代提交并使用 --index-filter,这是它最快(仍然很慢!)的过滤器.为此,请使用:

All these repository-editing tools should strive to make minimal changes. The BFG is probably the fastest and most convenient to use, but git filter-branch can be told to copy only all bad-and-descendant commits and to use --index-filter, which is its fastest (still slow!) filter. To do this, use:

git filter-branch --index-filter <command> -- <hash>..branch1 <hash>..branch2 ...

其中 是合适的 "git rm --cached --ignore-unmatch" 命令(一定要引用整个内容)和 和分支名称指定要复制的提交.请记住,A..B 语法意味着不要查看提交 A 或更早的提交,而查看提交 B 和更早的 所以如果提交 x 是,比如说,deadbeefbadf00d...,你会想要使用它的 parent 的哈希作为限制器:

where the <command> is an appropriate "git rm --cached --ignore-unmatch" command (be sure to quote the whole thing) and the <hash> and branch names specify which commits to copy. Remember that A..B syntax means don't look at commit A or earlier, while looking at commits B and earlier so if commit x is, say, deadbeefbadf00d..., you'll want to use the hash of its parent as the limiter:

git filter-branch --index-filter "..." -- deadbeefbadf00d^..master

例如(用正确的删除命令填写 ... 部分).

for instance (fill in the ... part with the right removal command).

(注意:我并没有真正使用过 The BFG,但如果它不必要地重新复制提交,那真的很糟糕,我敢打赌它不会.)

(Note: I have not actually used The BFG, but if it re-copies commits unnecessarily, that's really bad, and I bet it does not.)

这篇关于从 Git 历史记录中删除大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆