从 Git 历史记录中删除二进制文件后,为什么我的存储库仍然很大? [英] After deleting a binary file from Git history why is my repository still large?

查看:56
本文介绍了从 Git 历史记录中删除二进制文件后,为什么我的存储库仍然很大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以让我在这个问题的开头说我知道之前关于 Stackoverflow 主题的问题.事实上,我已经尝试了我能找到的所有解决方案,但我的存储库中有一个二进制文件拒绝删除并继续大大增加我的存储库大小.

So let me preface this question by saying that I am aware of the previous questions pertaining to subject on Stackoverflow. In fact I've tried all the solutions I could find but there is a binary file in my repo that just refuses to be removed and continues to greatly inflate my repo size.

我尝试过的方法,

Darhuuk 对 从 git repo 中完全删除文件的回答推荐了这两种方法

然而,在尝试了这两种解决方案之后,在 git 中查找大文件的脚本 仍然可以找到有问题的二进制文件.但是,this answer 中的脚本不再找到二进制文件的提交.这个答案建议了这两个脚本.

However, after trying both of those solutions the script to find large files in git still finds the offending binary. However the script from this answer no longer finds the commit for the binary. Both of these scripts were suggest by this answer.

尝试删除后,repo 仍然是 44mb,这对于相对较小的源来说太大了.哪些建议大文件脚本正确地完成它的工作.我试过推到 github(我做了一个 fork 以防万一)然后做一个新的克隆,看看 repo 大小是否减少,但它仍然是相同的大小.

The repo is still 44mb after the attempts at removal, which is way too large for the relative small size of the source. Which suggestions the large file script is doing it's job properly. I've tried pushing up to github (I made a fork just in case) and then doing a fresh clone to see if the repo size was decreased, but it is still the same size.

有人可以解释我做错了什么或提出替代方法吗?

Can someone explain what I am doing wrong or suggest an alternative method?

我应该注意,我不仅对修剪本地存储库中的文件感兴趣,还希望能够修复 Github 上的远程存储库.

I should note that I am not just interested in trimming the file from my local repo, I also want to be able to fix the remote repo on Github.

推荐答案

2017 您可能应该查看 BFG Repo-Cleaner 如果您正在阅读本文.

2017 You should probably look into BFG Repo-Cleaner if you are reading this.

很尴尬,我的本地存储库没有缩小的原因是因为我在过滤器分支中使用了错误的文件路径.因此,虽然我感谢 J-16 SDiZ 和 CodeGnome 的回答,但我的问题是在椅子和键盘之间.

So embarrassingly the reason why my local repos were not shrinking in size is because I was using the wrong path to the file in filter-branch. So while I thank J-16 SDiZ and CodeGnome for their answers my problem was between the chair and the keyboard.

为了让这个问题不再是我愚蠢的纪念碑,而是对人们真正有用,我花时间写下了在修剪回购后必须经历的步骤才能恢复回购在 Github 上.希望这可以帮助某人解决问题.

In an effort to make this question less of a monument to my stupidity and actually useful to people I've taken the time to write up the steps one would have to go through after trimming the repo in order to get the repo back up on Github. Hope this helps someone out down the line.

要删除有问题的文件,请运行下面的 shell 脚本,基于 Github removesensitive数据如何

To go about remove the offending files run the shell script below, based the Github remove sensitive data howto

#!/usr/bin/env bash
git filter-branch --index-filter 'git rm -r -q --cached --ignore-unmatch '$1'' --prune-empty --tag-name-filter cat -- --all

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

我浏览了本地存储库中的每个分支并执行了此操作,但老实说我不确定是否需要这样做,(您不需要在每个分支上都执行此操作)但是您这样做了下一步需要每个本地分支,所以请记住这一点.完成后,您应该会看到本地存储库的大小减少.您还应该能够在 CodeGnome 的答案中运行 blob 脚本并查看有问题的 blob 删除.如果不是,请仔细检查文件名和路径并确保它们正确.

I went through every branch on my local repository and did this, but I am honestly not sure if this is needed, (you don't need to do this on every branch) you do however need every branch local for the next step, so keep that in mind. Once you are done you should see the size decrease in your local repo. You should also be able to run the blob script in CodeGnome's answer and see the offending blob remove. If not double check the file name and path and make sure they are correct.

什么 git filter-branch 这里实际上是在 repo 中的每个提交中运行引号中列出的命令.

What git filter-branch is actually doing here is running the command listed in quotes on each commit in the repo.

脚本的其余部分只是清除旧数据的任何缓存版本.

The rest of the script just cleans any cached version of the old data.

既然本地存储库处于您需要的状态,那么诀窍就是将其备份到 Github 上.不幸的是,据我所知,无法从 Github 存储库中完全删除二进制数据,这里引用了 Github 敏感数据操作方法

Now that the local repo is in the state you need it to be the trick is to get it back up on Github. Unfortunately as far as I can tell there is no way to completely remove the binary data from the Github repo, here is the quote from the Github sensitive data howto

请注意,强制推送不会删除远程存储库上的提交,它只是引入新的提交并移动分支指针以指向它们.如果您担心用户直接通过 SHA1 访问错误提交,则必须删除该存储库并重新创建它.

Be warned that force-pushing does not erase commits on the remote repo, it simply introduces new ones and moves the branch pointer to point to them. If you are worried about users accessing the bad commits directly via SHA1, you will have to delete the repo and recreate it.

您需要重新创建 Github 存储库很糟糕,但好消息是重新创建存储库实际上非常容易.痛苦的是,您还必须重新创建问题和 wiki 中的数据,我将在下面介绍.

It sucks that you need to recreate the Github repo, but the good news that recreating the repo is actually pretty easy. The pain is that you also have to recreating the data in issues and the wiki, which I'll go into below.

我的建议是在 github 中创建一个新的 repo,然后当你准备好时用你的旧 repo 将它切换出来.这可以通过将旧的重命名为repo name old",然后将新创建的 repo 的名称更改为repo name"来完成.确保在创建新存储库时使用 README 取消选中初始化,否则您将无法处理干净的平板.

What I recommend is creating a new repo in github and then switch it out with your old repo when you are ready. This can be done by renaming the old to something like "repo name old" and then changing the name of the newly created repo to "repo name". Make sure when you create the new repo to uncheck initialize with README, otherwise your not going to be dealing with a clean slate.

如果你完成了最后一步,你应该清理你的仓库并准备好.现在需要更改遥控器以匹配新的 Github 存储库位置.我通过直接编辑 .git/config 文件来做到这一点,尽管我确信有人会告诉我这不是正确的做法.

If you completed the last step you should have your repo cleaned and ready to go. The remotes now need to changed to match the new Github repo location. I do this by editing the .git/config file directly, though I am sure someone is going to tell me that is not the right way to do it.

在进行推送之前,请确保您在本地存储库中拥有要推送的所有分支和标签.准备好后,使用以下命令推送所有分支

Before doing the push make sure you have all branches and tags you want to push up in your local repo. Once you are ready push all branches using the follow

git push --all
git push --tags

现在你应该有一个远程仓库来匹配你修剪过的本地仓库.仔细检查所有数据,以防万一.

Now you should have a remote repo to match your trimmed local repo. Double check that all data made just in case.

现在,如果您不必担心问题或 wiki,您就大功告成了.如果你继续阅读.

Now if you don't have to worry about issues or the wiki you are done. If you do read on.

Github wiki 只是与您的主存储库相关联的另一个存储库.因此,要开始在某处克隆您的旧 wiki 存储库.然后下一部分有点棘手,据我所知,您需要单击新存储库的 wiki 选项卡才能创建 wiki,但它为新创建的 wiki 植入了一个初始文件.所以我所做的,我不确定是否有更好的方法,是将遥控器更改为新创建的 wiki 存储库,并使用

The Github wiki is just another repo associated with your main repo. So to get started clone your old wiki repo somewhere. Then the next part is kind of tricky, as far as I can tell you need to click on the wiki tab of your new repo in order to create the wiki, but it seeds the newly created wiki with a an initial file. So what I did, and I am not sure if there is a better way, is change the remote to the newly create wiki repo and do a push to the new location using

git push --all --force

这里需要强制,否则git会抱怨当前分支的尖端不匹配.我认为这可能会使 git repo 中的初始页面处于分离状态,但是这对 repo 大小的影响应该可以忽略不计.

The force is needed here because otherwise git will complain about the tip of the current branch not matching. I think this may leave the initial page in a detached state in the git repo, but the effect of that on the size of the repo should be negligible.

这个答案对此给出了建议.但是看看脚本在答案中链接它看起来相当不完整,有一个用于评论导入的 TODO,我不知道它是否会带来问题状态.

There is advice on this given by this answer. But looking at the script linked in the answer it looks like it is fairly incomplete, there is a TODO for comment importing and I couldn't tell if it would be bring over the state of issues or not.

因此,鉴于我有一个相当小的未解决问题队列,而且我不介意丢失已解决的问题,因此我选择手动提交.请注意,在评论中正确归因于其他人是不可能做到这一点的.因此,我认为对于一个更成熟的大型项目,您需要编写一个更强大的脚本来完成所有工作,但对于我的特定情况,这不是必需的.

So given that I had a fairly small open issues queue and that I didn't mind losing closed issues I elected to bring things over by hand. Note that it is impossible to do this with proper attribution to other people on comments. So I think for a large more established project you would need to write a more robust script to bring everything over, but that wasn't needed for my particular case.

这篇关于从 Git 历史记录中删除二进制文件后,为什么我的存储库仍然很大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆