用git有效地重写（rebase -i）很多历史 [英] efficiently rewriting (rebase -i) a lot of history with git

查看：83 发布时间：2018/4/27 20:47:43 perl git git-rebase

本文介绍了用git有效地重写（rebase -i）很多历史的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个git仓库，在最新版本中有大约3500个提交和30,000个不同的文件。它代表了来自多个人的大约3年的工作，并且我们已经获得许可将其全部开放源代码。我正在努力发布整个历史，而不仅仅是最新版本。为了做到这一点，我对回溯时间感兴趣，并在创建时在文件顶部插入许可证标题。我确实有这个工作，但需要大约3天完全从虚拟硬盘运行，并且仍然需要一些人工干预。我知道它可以快得多，但我的git-fu并不完全符合这个任务。

问题：我怎样才能更快地完成同样的事情？

我目前的工作（在脚本中自动完成，但请耐心等待......）：

确定将新文件添加到存储库的所有提交（其中只有500个是fwiw）：
```
  git whatchanged --diff-filter = A --format = oneline 
  
```

将环境变量GIT_EDITOR定义为我自己的脚本，它将 pick 替换为 edit only a一次在文件的第一行（你会很快看到原因）。这是该操作的核心：
```
  perl -pi -e's / pick / edit / if $。 == 1'$ 1 
  
```

对于从 git whatchanged 上面，在添加文件的提交之前调用交互式rebase：
```
  git rebase -i decafbad001badc0da0000〜1 
  
```

我的自定义GIT_EDITOR （perl one-liner）将 pick 更改为 edit ，我们将被放到shell中以更改新的文件。另一个简单的 header-inserter 脚本在我试图插入的头文件中寻找一个已知的唯一模式（仅限于已知的文件类型（*。[chS]））。如果它不在那里，它会插入它，并且 git add 的文件。这种天真的技术并不知道在当前提交期间哪些文件实际添加了，但它最终做了正确的事情并且是幂等的（对同一个文件多次运行是安全的），并且不是整个过程都是瓶颈的地方。

现在我们很高兴我们已经更新了当前的提交，并调用了它：

  git commit --amend 
 git rebase --continue

rebase --continue 是昂贵的部分。由于我们在 whatchanged 的输出中为每个版本调用了一次 git rebase -i ，所以这是很多重定位。这段脚本几乎所有的时间都在观看Rebasing（2345/2733）计数器增量。

它也不仅仅是缓慢的。必须定期处理冲突。至少在这些情况下可能会发生这种情况（但可能更多）：（1）当新文件实际上是现有文件的副本时，对其第一行进行了一些更改（例如＃包括语句）。这是一个真正的冲突，但可以在大多数情况下自动解决（是的，有一个脚本处理）。（2）文件被删除时。这可以通过确认我们想用 git rm 删除它来解决。（3）有些地方看起来像 diff 的表现很糟糕，例如，只是增加了一个空行。其他更合理的冲突需要人工干预，但总体而言，它们不是最大的瓶颈。最大的瓶颈是绝对只是坐在那里盯着Rebasing（xxxx / yyyy）。

现在，单个rebase是从更新的提交到更旧的提交，从 git whatchanged 的输出顶部开始。这意味着第一个rebase影响了昨天的提交，最终我们将从3年前开始重新提交提交。从较新到较旧似乎与直觉相反，但到目前为止，我不相信这很重要，除非我们将多个 pick 更改为编辑时调用rebase。我害怕这样做，因为冲突确实到来了，我不想处理冲突波澜的浪潮，试图一蹴而就。也许有人知道一种方法来避免这种情况？我一直未能拿出一个。

我开始研究git对象的内部工作 1 ！它似乎应该有一个更有效的方式来走动对象图，只是做我想做的更改。

请注意，此存储库来自一个SVN仓库，我们实际上没有使用标签或分支（我已经 git filter-branch 将它们删除了），所以我们确实有方便的直线历史记录。没有git分支或合并。

我确定我遗漏了一些重要信息，但这篇文章看起来已经太长了。我会尽我所能按要求提供更多信息。最后，我可能需要发布我的各种脚本，这是一种可能性。我的目标是弄清楚如何在git仓库中重写历史记录;不要辩论其他可行的授权和代码发布方法。

谢谢！

更新2012-06-17 ：博客文章与所有血淋淋的细节。
解决方案
使用

git filter-branch -f --tree-filter'[[-f README]]&&回声--- FOOTER --->> README'HEAD
基本上会在 README 文件，并且历史记录看起来就像文件创建后一直存在的那样，我不确定它是否对您有效，但它是正确的方式。

制作一个自定义脚本，你可能会得到一个好的项目历史，做太多的魔术（rebase，perl，脚本编辑器等）最终可能会丢失或改变项目历史意外的方式。

jon （OP）使用了这个基本模式来实现显着简化和加速的目标。

git filter-branch -d / dev / shm / git --tree-filter \ 'perl / path / to / find-add-license.pl'--prune-empty HEAD
少数表现使用 -d<目录> 指向ramdisk目录的参数（如 / dev / shm / foo ）将显着提高速度。
p>使用其内置语言功能，使用小型实用程序（例如 find ）完成的叉会从单个脚本进行所有更改，这会使该过程减慢很多次。避免这样做：
git filter-branch -d / dev / shm / git --tree-filter \ '找。 -name*。[chS]-exec perl /path/to/just-add-license.pl \ {\} \;'\ --prune-empty HEAD

这是OP使用的perl脚本的消毒版本：
＃！/ usr / bin / perl -w 使用File :: Slurp; 使用File :: Find; my @dirs = qw（aDir anotherDir nested / DIR）; my $ header =请把我放在每个文件的顶部。; foreach我的$ dir（@dirs）{ if（-d $ dir）{ find（\& Wanted，$ dir）; } } 通缉{ /\.c$|\\..h$|\.S $ /或返回; ＃*。[chs] my $ file = $ _; my $ contents = read_file（$ file）; $ contents =〜s / \r\ n？/ \ n / g; ＃将DOS或old-mac的行结束符转换为Unix 除非（$ contents =〜/请把我放在每个文件的顶部\。/）{ write_file（$ file，{atomic => ; 1}，$ header，$ contents）; } }

I have a git repository with about 3500 commits and 30,000 distinct files in the latest revision. It represents about 3 years of work from multiple people and we have received permission to make it all open-source. I am trying hard to release the entire history, instead of just the latest version. To do this I am interested in "going back in time" and inserting a license header at the top of files when they are created. I actually have this working, but it takes about 3 days running entirely out of a ramdisk, and still does require a little bit of manual intervention. I know it can be a lot faster, but my git-fu is not quite up to the task.

The question: how can I accomplish the same thing a lot faster?

What I currently do (automated in a script, but please bear with me...):

Identify all of the commits where a new file was added to the repository (there are just shy of 500 of these, fwiw):
git whatchanged --diff-filter=A --format=oneline

Define environment variable GIT_EDITOR to be my own script that replaces pick with edit only a single time on the first line of the file (you will see why shortly). This is the core of the operation:
perl -pi -e 's/pick/edit/ if $. == 1' $1

For each commit from the output of git whatchanged above, invoke an interactive rebase starting just before the commit that added the file:
git rebase -i decafbad001badc0da0000~1

My custom GIT_EDITOR (that perl one-liner) changes pick to edit and we are dropped to a shell to make changes to the new file. Another simple header-inserter script looks for a known unique pattern in the header that I'm trying to insert (only in known file types (*.[chS] for me)). If it's not there, it inserts it, and git add's the file. This naive technique has no knowledge of which files were actually added during the present commit, but it ends up doing the right thing and being idempotent (safe to run multiple times against the same file), and is not where this whole process is bottlenecked anyways.

At this point we're happy that we've updated the current commit, and invoke:
git commit --amend git rebase --continue
The rebase --continue is the expensive part. Since we invoke a git rebase -i once for every revision in the output of whatchanged, that's a lot of rebasing. Almost all of the time during which this script runs is spent watching the "Rebasing (2345/2733)" counter increment.

It's also not just slow. There are periodically conflicts that must be addressed. This can happen in at least these cases (but likely more): (1) when a "new" file is actually a copy of an existing file with some changes made to its very first lines (e.g., #include statements). This is a genuine conflict but can be resolved automatically in most cases (yep, have a script that deals with that). (2) when a file is deleted. This is trivially resolvable by just confirming that we want to delete it with git rm. (3) there are some places where it seems like diff just behaves badly, e.g., where the change is only the addition of a blank line. Other more legitimate conflicts require manual intervention but on the whole they are not the biggest bottleneck. The biggest bottleneck is absolutely just sitting there staring at "Rebasing (xxxx/yyyy)".

Right now the individual rebases are initiated from newer commits to older commits, i.e., starting from the top of the output of git whatchanged. This means that the very first rebase affects yesterday's commits, and that eventually we'll be rebasing commits from 3 years ago. Going from "newer" to "older" seems counter-intuitive, but so far I'm not convinced that it matters unless we change more than one pick to an edit when invoking the rebase. I am afraid to do this because conflicts do arrive, and I don't want to deal with a tidal wave of conflict ripples from trying to rebase everything in one go. Maybe somebody knows a way to avoid that? I haven't been able to come up with one.

I started looking at the internal workings of git objects 1! It does seem like there should be a much more efficient way to walk the object graph and just make the changes that I want to make.

Please note that this repository came from an SVN repository where we effectively made no use of tags or branches (I already git filter-branched them away), so we do have the convenience of a straight-line history. No git branches or merges.

I'm sure I've left out some critical information, but this post already seems excessively long. I will do my best to provide more information as requested. In the end I may need to just publish my various scripts, which is a possibility. It is my objective to figure out how to rewrite history thusly in a git repository; not to debate other viable methods of licensing and code release.

Thanks!

Update 2012-06-17: Blog post with all the gory details.
解决方案
Using
git filter-branch -f --tree-filter '[[ -f README ]] && echo "---FOOTER---" >> README' HEAD
Would essentially add a footer line to the README file, and the history would look like it has been there since file creation, i'm not sure if it will be efficient enough for you but it is the correct way to do it.

Craft a custom script and you'll probably end up with a good project history, doing too much "magic" (rebase, perl, scripted editors, etc) may end up losing or changing project history in unexpected ways.

jon (the OP) used this basic pattern to achieve the goal with significant simplification and speedup.
git filter-branch -d /dev/shm/git --tree-filter \ 'perl /path/to/find-add-license.pl' --prune-empty HEAD
A few performance-critical observations.

Using the -d <directory> parameter pointing to a ramdisk directory (like /dev/shm/foo) will improve the speed significantly.

Do all changes from a single script, using its built-in language features, the forks done while using small utilities (like find), will slow the process many times. Avoid this:
git filter-branch -d /dev/shm/git --tree-filter \ 'find . -name "*.[chS]" -exec perl /path/to/just-add-license.pl \{\} \;' \ --prune-empty HEAD

This is a sanitized version of the perl script the OP used:
#!/usr/bin/perl -w use File::Slurp; use File::Find; my @dirs = qw(aDir anotherDir nested/DIR); my $header = "Please put me at the top of each file."; foreach my $dir(@dirs) { if (-d $dir) { find(\&Wanted, $dir); } } sub Wanted { /\.c$|\.h$|\.S$/ or return; # *.[chS] my $file = $_; my $contents = read_file($file); $contents =~ s/\r\n?/\n/g; # convert DOS or old-Mac line endings to Unix unless($contents =~ /Please put me at the top of each file\./) { write_file( $file, {atomic => 1}, $header, $contents ); } }

这篇关于用git有效地重写（rebase -i）很多历史的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用git有效地重写（rebase -i）很多历史 [英] efficiently rewriting (rebase -i) a lot of history with git

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用git有效地重写（rebase -i）很多历史 [英] efficiently rewriting (rebase -i) a lot of history with git

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭