如何过滤基于gitignore的历史记录? [英] How to filter history based on gitignore?

查看:100
本文介绍了如何过滤基于gitignore的历史记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为了清楚这个问题,我不问如何从历史记录中删除单个文件,就像这个问题一样:完全从所有Git存储库提交历史中删除文件。我也没有询问从gitignore中找到 untracking 文件,就像这个问题一样:忽略已被提交到Git存储库的文件



我正在讨论更新.gitignore文件,并随后删除与历史记录匹配的所有内容,或多或少像这样的问题:。然而,不幸的是,这个问题的答案不适用于这个目的,所以我在这里试着详细说明这个问题,并希望找到一个很好的答案,它不涉及人类通过整个源代码树来手动执行过滤器分支在每个匹配的文件上。

在这里,我提供了一个测试脚本,目前正在执行忽略已经提交给Git仓库的文件。它将在PWD下删除并创建一个文件夹 root ,所以在运行它之前要小心。

 #!/ bin / bash -e 

TESTROOT = $ {PWD}
GREEN =\e [32m
RESET =\e [39m

rm -rf root
mkdir -v root
pushd root

mkdir -v repo
pushd repo
git init

touch abcx
mkdir -v main
touch main / {a,x,y,z}

#初始提交
git add。
git commit -m初始提交
echo -e$ {GREEN}第一次提交的内容$ {RESET}
git ls-files | tee ../00-Initial.txt

#添加另一个提交,仅用于演示
touch defyz main / {b,c}
##进行其他更改
回声测试| tee a | tee b | tee c | tee x | tee main / a> main / x
git add。
git commit -m一些编辑

echo -e$ {GREEN}第二次提交的内容$ {RESET}
git ls-files | tee ../01-Changed.txt

#现在我想忽略所有'a'和'b',以及所有'main / x',但不是'main / b'
##签出root提交
git checkout -b temp $(git rev-list HEAD | tail -1)
## Add .gitignores
echoa>> .gitignore
echob>> .gitignore
echox>> main / .gitignore
echo!b>> main / .gitignore
git add。
git commit --amend -m初始提交(2)
## --v不知道它是否正确
git rebase --nt temp master
git checkout master
## --v现在,为什么我应该删除这个分支?
git branch -D temp
echo -e$ {GREEN} rebase $ {RESET}后的内容
git ls-files | tee ../02-Rebased.txt

#假设重写历史记录
git filter-branch --tree-filter'git clean -f -X' - --all
echo -e$ {GREEN}过滤分支$ {RESET}后的内容
git ls-files | tee ../03-Rewritten.txt

echoa'的历史
git log -pa

popd#repo

popd#root

此代码创建一个存储库,添加一些文件,进行一些编辑并执行清洁程序。另外,还会生成一些日志文件。 理想情况下,我希望 a b main / x 从历史中消失,而 main / b 保留。但是,现在没有任何东西被从历史中删除。为了实现这个目标,应该修改什么?



如果这可以在多个分支上完成,则需要加分。但现在,请将它保存到一个主分支中。 解决方案

实现你想要的结果有点棘手。使用 git filter-branch - tree-filter 最简单的方法将非常慢。 编辑:我修改了您的示例脚本来执行此操作;首先,让我们注意一个约束:你可以从不改变任何已存在的提交。 你所能做的就是使新的提交看起来很像旧的提交,但是新的改进。然后你指示Git停止查看旧的提交,并只看新的提交。这是我们在这里要做的。 (然后,如果需要的话,你可以强制Git真的忘记旧的提交。最简单的方法是重新克隆克隆。)



现在,要重新提交可从一个或多个分支和/或标记名访问的每个提交,除了我们明确指出要更改的所有内容外,我们可以使用 git filter-branch 。 filter-branch命令有一个相当令人眼花缭乱的过滤选项数组,其中大部分是为了让它更快,因为复制每个提交都非常慢。如果存储库中只有几百个提交,每个文件只有几十或几百个文件,这并不是很糟糕;但是如果每个文件大约有10万个文件,每个文件大约有10万个文件(10,000,000,000个文件),那么这个文件将被检查并重新提交。这将需要一段时间。



不幸的是,没有简单和方便的方法来加速这一切。加速它的最好方法是使用 - index-filter ,但没有内置的索引过滤器命令可以实现您想要的功能。最简单的过滤器是 - tree-filter ,这也是最慢的过滤器。您可能想要尝试编写自己的索引过滤器,可能是使用shell脚本,或者使用其他语言(您仍然需要以任何方式调用 git update-index )) 。




1 已签名的注释标签不能完整保留,因此其签名将被剥离。签名提交可能会使其签名变为无效(如果提交哈希更改,这取决于它是否必须:请记住提交的哈希ID是提交内容的校验和,因此如果文件集更改,则校验和更改;但是,如果父提交的校验和发生变化,则此提交的校验和也会发生变化。)

使用<$ c $
$ b


c> - tree-filter



当您使用 git filter-branch - tree-filter ,过滤分支代码的作用是将每次提交一次提取到临时目录中。这个临时目录没有 .git 目录,并且不在你运行的地方 git filter-branch (它实际上在除非使用 -d 选项将Git重定向到内存文件系统,否则 .git 目录的子目录,这是加速它的一个好主意)。



将整个提交提取到此临时目录后,Git运行您的树过滤器。一旦你的树型过滤器结束,Git将该临时目录中的所有东西打包到新的提交中。无论你离开那里,都会加入。无论你添加到那里,都会被添加。无论您在那里修改什么,都会被修改。不管你从那里移除什么,它都不在新的提交中。



请注意,这个 .gitignore 文件临时目录对提交的内容没有影响(但是 .gitignore 文件本身 将被提交,因为临时目录中的任何内容都将成为新的复制提交)。因此,如果您想确保某个已知路径的文件不是提交的,只需 rm -f known / path / to / file.ext 。如果该文件位于临时目录中,则该文件现在不存在了。如果不是,没有任何反应,一切都很好。



因此,一个可行的树型过滤器应该是:

<$ p $ rm -f $(cat / tmp / files-to-remove)



< (假设文件名中没有空白问题;使用 xargs ... | rm -f )来避免空白问题,无论您希望为xargs输入使用哪种编码; -z 风格编码非常理想,因为 \ 0 在路径名称中是被禁止的)。



将其转换为索引过滤器

使用索引过滤器可让Git跳过抽取和检查阶段。如果你在正确的表格中有一个固定的删除列表,那么它很容易使用。



假设你的文件名在 / tmp / files-to-remove 以适合于 xargs -0 的形式。您的索引过滤器可能会完整阅读:

  xargs -0 / tmp / files-to-remove | git rm --cached -f --ignore-unmatch 

它与<$基本相同上面的c $ c> rm -f ,但在Git用于每次提交被复制的临时索引中。 (将 -q 添加到 git rm --cached 以使其保持安静。)



在树形过滤器中应用 .gitignore 文件



您的示例脚本尝试使用一个 - tree-filter 在重新绑定到具有所需项目的初始提交之后:

  git filter-branch --tree-filter'git clean -f -X' -  --all 

尽管有一个初始错误( git rebase 是错误的):

  -git rebase  - 临时大师
+ git rebase - 临时临时大师

解决这个问题仍然不起作用,原因是 git clean -f -X 只能删除实际上忽略了。任何已经在索引中的文件都不会被忽略。



诀窍是清空索引。然而,这样做太多了: git clean 然后永远不会下降到子目录 - 所以技巧分为两部分:清空索引,然后重新 - 用非忽略的文件填充它。

  -git filter-branch --tree-filter'git clean -f -X' -  --all 
+ git filter-branch --tree-filter'git rm --cached -qrf。 &安培;&安培; git add。 &安培;&安培; git clean -fqX' - --all

(我在这里添加了几个安静标志)。

为了避免首先需要重新安装初始 .gitignore 文件,假设您有一个master设置 .gitignore 您在每次提交时都需要的文件(我们将在树滤波器中使用它)。只需将它们放在临时树中即可:

  mkdir / tmp / ignores-to-add 
cp .gitignore / tmp / ignores-to-add
mkdir / tmp / ignore-to-add / main
cp main / .gitignore / tmp / ignoreores-to-add

(我会留下脚本来查找并复制 .gitignore 文件给你,似乎有点烦人没有一个)。然后,对于 - 树型过滤器,请使用:

  cp -R / tmp /忽略 - 添加。 &安培;&安培; 
git rm --cached -qrf。 &安培;&安培;
git add。 &安培;&安培;
git clean -fqX

第一步, cp -R (可以在 git add。之前的任何地方完成),安装正确的 .gitignore 文件。由于我们对每次提交都执行此操作,所以在运行 filter-branch 之前,我们不需要重新绑定。第二个移除一切从索引。 (稍微快一点的方法就是 rm $ GIT_INDEX_FILE ,但不能保证这会永久运作。)



第三次重新添加,即临时树中的所有内容。由于 .gitignore 文件存在,我们只添加不被忽略的文件。



最后一步 git clean -qfX ,删除被忽略的工作树文件,这样 filter-branch 不会把它们放回去。


To be clear on this question, I am not asking about how to remove a single file from history, like this question: Completely remove file from all Git repository commit history. I am also not asking about untracking files from gitignore, like in this question: Ignore files that have already been committed to a Git repository.

I am talking about "updating a .gitignore file, and subsequently removing everything matching the list from history", more or less like this question: Ignore files that have already been committed to a Git repository. However, unfortunately, the answer from that question does not work for this purpose, so I am here to try elaborating the question and hopefully find a good answer that does not involve a human looking through an entire source tree to manually do a filter-branch on each matched file.

Here I provide a test script, currently performing the procedure in the answer of Ignore files that have already been committed to a Git repository. It is going to remove and create a folder root under PWD, so be careful before running it. I will describe my goal after the code.

#!/bin/bash -e

TESTROOT=${PWD}
GREEN="\e[32m"
RESET="\e[39m"

rm -rf root
mkdir -v root
pushd root

mkdir -v repo
pushd repo
git init

touch a b c x 
mkdir -v main
touch main/{a,x,y,z}

# Initial commit
git add .
git commit -m "Initial Commit"
echo -e "${GREEN}Contents of first commit${RESET}"
git ls-files | tee ../00-Initial.txt

# Add another commit just for demo
touch d e f y z main/{b,c}
## Make some other changes
echo "Test" | tee a | tee b | tee c | tee x | tee main/a > main/x
git add .
git commit -m "Some edits"

echo -e "${GREEN}Contents of second commit${RESET}"
git ls-files | tee ../01-Changed.txt

# Now I want to ignore all 'a' and 'b', and all 'main/x', but not 'main/b'
## Checkout the root commit
git checkout -b temp $(git rev-list HEAD | tail -1)
## Add .gitignores
echo "a" >> .gitignore
echo "b" >> .gitignore
echo "x" >> main/.gitignore
echo "!b" >> main/.gitignore
git add .
git commit --amend -m "Initial Commit (2)"
## --v Not sure if it is correct
git rebase --onto temp master
git checkout master
## --v Now, why should I delete this branch?
git branch -D temp
echo -e "${GREEN}Contents after rebase${RESET}"
git ls-files | tee ../02-Rebased.txt

# Supposingly, rewrite history
git filter-branch --tree-filter 'git clean -f -X' -- --all
echo -e "${GREEN}Contents after filter-branch${RESET}"
git ls-files | tee ../03-Rewritten.txt

echo "History of 'a'"
git log -p a

popd # repo

popd # root

This code creates a repository, adds some files, do some edit, and perform the cleaning procedure. Also, some log files are generated. Ideally, I would like a, b, and main/x disappear from history, while main/b stays. However, right now nothing is removed from history. What should be modified to perform this goal?

Bonus points if this can be done on multiple branches. But for now, keep it to a single master branch.

解决方案

Achieving the result you want is a bit tricky. The simplest way, using git filter-branch with a --tree-filter, will be very slow. Edit: I've modified your example script to do this; see the end of this answer.

First, let's note one constraint: you can never change any existing commit. All you can do is make new commits that look a lot like the old ones, but "new and improved". You then direct Git to stop looking at the old commits, and look only at the new ones. This is what we will do here. (Then, if required, you can force Git to really forget the old commits. The easiest way is to re-clone the clone.)

Now, to re-commit every commit that is reachable from one or more branch and/or tag names, preserving everything except that which we explicitly tell it to change,1 we can use git filter-branch. The filter-branch command has a rather dizzying array of filtering options, most of which are meant to make it go faster, because copying every commit is pretty slow. If there are just a few hundred commits in a repository, with a few dozens or hundreds of files each, it's not so bad; but if there are about 100k commits holding about 100k files each, that's ten thousand million files (10,000,000,000 files) to examine and re-commit. It is going to take a while.

Unfortunately there is no easy and convenient way to speed this up. The best way to speed it up would be to use an --index-filter, but there is no built in index filter command that will do what you want. The easiest filter to use is --tree-filter, which is also the slowest one there is. You might want to experiment with writing your own index filter, perhaps in shell script or perhaps in another language you prefer (you will still need to invoke git update-index either way).


1Signed annotated tags cannot be preserved intact, so their signatures will be stripped. Signed commits may have their signatures become invalid (if the commit hash changes, which depends on whether it must: remember that the hash ID of a commit is the checksum of the commit's contents, so if the set of files changes, the checksum changes; but if the checksum of a parent commit changes, the checksum of this commit also changes).


Using --tree-filter

When you use git filter-branch with --tree-filter, what the filter-branch code does is to extract each commit, one at a time, into a temporary directory. This temporary directory has no .git directory and is not where you are running git filter-branch (it's actually in a subdirectory of the .git directory unless you use the -d option to redirect Git to, say, a memory filesystem, which is a good idea for speeding it up).

After extracting the entire commit into this temporary directory, Git runs your tree-filter. Once your tree-filter finishes, Git packages up everything in that temporary directory into the new commit. Whatever you leave there, is in. Whatever you add to there, is added. Whatever you modify there, is modified. Whatever you remove from there, is no longer in the new commit.

Note that a .gitignore file in this temporary directory has no effect on what will be committed (but the .gitignore file itself will be committed, since whatever is in the temporary directory becomes the new copy-commit). So if you want to be sure that a file of some known path is not committed, simply rm -f known/path/to/file.ext. If the file was in the temporary directory, it is now gone. If not, nothing happens and all is well.

Hence, a workable tree filter would be:

rm -f $(cat /tmp/files-to-remove)

(assuming no white space issues in file names; use xargs ... | rm -f to avoid white space issues, with whatever encoding you like for the xargs input; -z style encoding is ideal since \0 is forbidden in path names).

Converting this to an index filter

Using an index filter lets Git skip the extract-and-examine phases. If you had a fixed "remove" list in the right form, it would be easy to use.

Let's say you have the file names in /tmp/files-to-remove in a form that is suitable for xargs -0. Your index filter might then read, in its entirety:

xargs -0 /tmp/files-to-remove | git rm --cached -f --ignore-unmatch

which is basically the same as the rm -f above, but works within the temporary index Git uses for each commit-to-be-copied. (Add -q to the git rm --cached to make it quiet.)

Applying .gitignore files in a tree filter

Your example script tries to use a --tree-filter after rebasing onto an initial commit that has the desired items:

git filter-branch --tree-filter 'git clean -f -X' -- --all

There is one initial bug though (the git rebase is wrong):

-git rebase --onto temp master
+git rebase --onto temp temp master

Fixing that, the thing still doesn't work, and the reason is that git clean -f -X only removes files that are actually ignored. Any file that is already in the index, is not actually ignored.

The trick is to empty out the index. However, this does too much: git clean then never descends into subdirectories—so the trick comes in two parts: empty out the index, then re-fill it with non-ignored files. Now git clean -f -X will remove the remaining files:

-git filter-branch --tree-filter 'git clean -f -X' -- --all
+git filter-branch --tree-filter 'git rm --cached -qrf . && git add . && git clean -fqX' -- --all

(I added several "quiet" flags here).

To avoid needing to rebase in the first place to install initial .gitignore files, let's say you have a master set of .gitignore files you want in every commit (which we'll then use in the tree filter as well). Simply place these, and nothing else, in a temporary tree:

mkdir /tmp/ignores-to-add
cp .gitignore /tmp/ignores-to-add
mkdir /tmp/ignores-to-add/main
cp main/.gitignore /tmp/ignores-to-add

(I'll leave working up a script that finds and copies just .gitignore files to you, it seems moderately annoying to do without one). Then, for the --tree-filter, use:

cp -R /tmp/ignores-to-add . &&
    git rm --cached -qrf . &&
    git add . &&
    git clean -fqX

The first step, cp -R (which can be done anywhere before the git add ., really), installs the correct .gitignore files. Since we do this to each commit, we never need to rebase before running filter-branch.

The second removes everything from the index. (A slightly faster method is just rm $GIT_INDEX_FILE but it's not guaranteed that this will work forever.)

The third re-adds ., i.e., everything in the temporary tree. Since the .gitignore files are in place, we only add non-ignored files.

The last step, git clean -qfX, removes work-tree files that are ignored, so that filter-branch won't put them back.

这篇关于如何过滤基于gitignore的历史记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆