如何获取整个git历史记录中每个文件的大小? [英] How to get size for each file in entire git history?

查看:116
本文介绍了如何获取整个git历史记录中每个文件的大小?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从git存储库中删除大型文件.但是,我想具体说明一下,所以我想查看存储库所有历史记录中的所有文件大小吗?

我创建了以下bash脚本,但效率似乎很低,并且可能缺少历史记录中已删除的文件:

git log --pretty=tformat:%H | while read hash; do
   git show --stat --name-only $hash | grep -P '^(?:(?!commit|Author:|Date:|Merge:|   ).)*$' | while read filename; do
      if [ ! -z "$filename" ]; then
          git show "$hash:$filename" | wc -c | while read filesize; do
             if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
                printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
             fi
          done
      fi
   done
done

有什么更好的建议吗?

解决方案

实际上,您已经到了那里.

git log --pretty=tformat:%H

这应该只是git rev-list <start-points>,例如git rev-list HEADgit rev-list --all.您可能要添加--topo-order --reverse,原因是我们稍后会看到.

 | while read hash; do
   git show --stat --name-only $hash

您可能只想在散列上使用git ls-tree而不是git show --stat.使用递归git ls-tree,您将找到给定提交中的每棵树和blob及其对应的路径名.

树可能没有意思,所以我们可能会落入斑点.请注意,顺便说一句,除非您使用-z,否则git ls-tree将对一些有问题的文件名进行编码(但这使读取项目变得更加困难; bash可以做到,纯sh不能).

 | grep -P '^(?:(?!commit|Author:|Date:|Merge:|   ).)*$' | while read filename; do

使用git ls-tree,我们可以将其替换为:

git ls-tree -r $hash | while read mode type objhash path; do

然后我们将跳过类型不是blob的所有内容:

[ $type == blob ] || continue

  if [ ! -z "$filename" ]; then

我们根本不需要它.

      git show "$hash:$filename" | wc -c | while read filesize; do
         if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
            printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
         fi

我不清楚您为什么有while read filesize循环,也没有复杂的测试.无论如何,获取blob对象大小的简单方法是使用git cat-file -s $objhash,例如,可以轻松测试[ $blobsize -gt 100000 ]:

    blobsize=$(git cat-file -s $objhash)
    if [ $blobsize -gt 100000 ]; then
       echo "$hash contains $filename size $blobsize"
    fi

但是,通过放弃git show而不是git ls-tree -r,我们可以在 every 提交中看到每个文件的 every 副本,而不是只看到一次,在它出现的第一个提交中.例如,如果提交f00f1e添加了大文件bigfile,并且文件在提交baafba6中保持不变,我们将两次看到它.使用git show --stat运行git diff的变体,将每个提交与其父提交进行比较,因此,如果我们以前看过它,则可以忽略该文件.

轻微的缺陷(或也许没有缺陷)是,如果文件返回,我们会重新看到"文件.例如,如果该大文件在第三次提交中被删除并在第四次提交中恢复,我们将看到它两次.

这是我们可能想要的--topo-order --reverse.如果使用此选项,则所有父提交都将在其子提交之前.然后,我们可以保存每个诊断出的对象哈希,并抑制重复诊断.在这里,一种很好的具有关联数组(哈希表)的编程语言会很方便,但是我们可以使用包含以前显示的对象哈希的文件或目录,以纯bash的方式进行此操作:

#! /bin/sh

# get temporary file to hold viewed object hashes
TF=$(mktemp)
trap "rm -f $TF" 0 1 2 3 15

BIG=100000  # files up to (and including?) this size are not-big

git rev-list --all --topo-order --reverse |
while read commithash; do
    git ls-tree -r $commithash |
    while read mode type objhash path; do
        [ $type == blob ] || continue      # only look at files
        blobsize=$(git cat-file -s $objhash)
        [ $blobsize -lt $BIG ] && continue # or -le
        # found a big file - have we seen it yet?
        grep $objhash $TF >/dev/null && continue
        echo "$blobsize byte file added at commit $commithash as $path"
        echo $objhash >> $TF # don't print again under any path name
    done
done

请注意,由于我们现在可以通过哈希ID记住大型文件,因此即使它们以其他名称重新出现(例如,添加git mv或被删除然后重新出现),我们也不会重新宣布它们.相同或另一个名称).

如果您喜欢git show使用的差异调用方法,我们可以使用它代替保存哈希的临时文件,但仍可以通过使用适当的管道命令来避免笨拙地删除提交消息. git diff-tree.尽管不再需要使用--topo-order(仅作为一般规则),这仍然是明智的选择.因此,这给出了:

BIG=100000 # just as before

git rev-list --all --topo-order | while read commithash; do
    git diff-tree -r --name-only --diff-filter=AMT $commithash |
        tail -n +2 | while read path; do
            objsize=$(git cat-file -s "$commithash:$path")
            [ $objsize -lt $BIG ] && continue
            echo "$blobsize byte file added at commit $commithash as $path"
        done
done

git diff-tree需要-r进行递归工作(与git ls-tree相同),需要--name-only仅打印文件名,并且需要--diff-filter=AMT仅打印添加,修改或键入的文件名-已更改(从符号链接更改为文件,反之亦然).令人讨厌的是,git diff-tree再次将提交ID打印为第一行.我们可以使用--no-commit-id取消显示ID,但是会得到一个空白行,因此我们也可以仅使用tail -n +2跳过第一行.

脚本的其余部分与您的脚本相同,只是我们使用git cat-file -s轻松获得对象的大小,并直接使用[/test程序对其进行测试.

请注意,对于合并提交,git diff-tree(如git show)使用合并的差异,仅显示在合并结果中与任何父文件都不匹配的文件.这应该可以,因为如果文件huge在合并结果中为4GB,但与文件huge在两个合并提交之一中为4GB相同,则将huge添加到该提交中时,我们会看到在合并本身中看到它的结果.

(如果不希望这样,可以将-m添加到git diff-tree命令.但是,您需要放下tail -n +2并放入--no-commit-id,在-m下的行为不同.尽管在 default 输出格式(与git log --raw类似)中有意义,但是Git中的这种特殊行为有点令人讨厌.

(注意:上面的代码未经过测试-发现并修复了$hash$commithash的最后一次重新读取.)

I'd like to prune large files from my git repository. However, I'd like to be specific about it, so I would like to see all file sizes in all of the history for the repository?

I've created the following bash script, but it seems quite inefficent and may be missing files that have been deleted somewhere in history:

git log --pretty=tformat:%H | while read hash; do
   git show --stat --name-only $hash | grep -P '^(?:(?!commit|Author:|Date:|Merge:|   ).)*$' | while read filename; do
      if [ ! -z "$filename" ]; then
          git show "$hash:$filename" | wc -c | while read filesize; do
             if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
                printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
             fi
          done
      fi
   done
done

Any suggestions on a better way to go about it?

解决方案

You are most of the way there, really.

git log --pretty=tformat:%H

This should just be git rev-list <start-points>, e.g., git rev-list HEAD or git rev-list --all. You may want to add --topo-order --reverse for reasons we'll reach in a moment.

 | while read hash; do
   git show --stat --name-only $hash

Instead of git show --stat, you probably just want to use git ls-tree on the hash. Using a recursive git ls-tree you will find every tree and blob within the given commit, along with its corresponding path name.

The trees are probably not interesting, so we might drop down to blobs. Note, by the way, that git ls-tree will encode some problematic file names unless you use -z (but this makes it harder to read the items; bash can do it, plain sh can't).

 | grep -P '^(?:(?!commit|Author:|Date:|Merge:|   ).)*$' | while read filename; do

Using git ls-tree we can replace this with:

git ls-tree -r $hash | while read mode type objhash path; do

and then we'll skip anything whose type is not blob:

[ $type == blob ] || continue

  if [ ! -z "$filename" ]; then

We won't need this at all.

      git show "$hash:$filename" | wc -c | while read filesize; do
         if [ $(echo "$filesize > 100000" | bc) -eq 1 ]; then
            printf "%-40s %11s %s\n" "$hash" "$filesize" "$filename"
         fi

It's not clear to me why you have a while read filesize loop, nor the complex tests. In any case the easy way to get the size of the blob object is with git cat-file -s $objhash, and it's easy to test [ $blobsize -gt 100000 ] for instance:

    blobsize=$(git cat-file -s $objhash)
    if [ $blobsize -gt 100000 ]; then
       echo "$hash contains $filename size $blobsize"
    fi

However, by giving up git show in favor of git ls-tree -r, we see every copy of each file in every commit, rather than just seeing it once, in the first commit in which it appears. For instance, if commit f00f1e adds big file bigfile and it persists in commit baafba6 unchanged, we'll see it both times. Using git show --stat runs a variant of git diff to compare each commit against its parent(s), so that we omit the file if we have seen it before.

The slight defect (or maybe not-defect) is that we "re-see" a file if it comes back. For instance if that big file is removed in the third commit and restored in the fourth, we'll see it twice.

This is where we may want --topo-order --reverse. If we use this, we'll get all parent commits before their children. We can then save each diagnosed object hash, and suppress a repeat diagnostic. Here a nice programming language that has associative arrays (hash tables) would be handy, but we can do this in plain bash with a file or directory that contains previously-displayed object hashes:

#! /bin/sh

# get temporary file to hold viewed object hashes
TF=$(mktemp)
trap "rm -f $TF" 0 1 2 3 15

BIG=100000  # files up to (and including?) this size are not-big

git rev-list --all --topo-order --reverse |
while read commithash; do
    git ls-tree -r $commithash |
    while read mode type objhash path; do
        [ $type == blob ] || continue      # only look at files
        blobsize=$(git cat-file -s $objhash)
        [ $blobsize -lt $BIG ] && continue # or -le
        # found a big file - have we seen it yet?
        grep $objhash $TF >/dev/null && continue
        echo "$blobsize byte file added at commit $commithash as $path"
        echo $objhash >> $TF # don't print again under any path name
    done
done

Note that since we now remember large files by their hash ID, we won't re-announce them even if they re-appear under another name (e.g., get git mved, or are removed and then re-appear under the same or another name).

If you prefer the diff-invoking method that git show uses, we can use that instead of our hash-saving temporary file, but still avoid the clumsy grepping away of commit messages, by using the appropriate plumbing command, which is git diff-tree. It's also probably still wise to use --topo-order (just as a general rule), although it's no longer required. So this gives:

BIG=100000 # just as before

git rev-list --all --topo-order | while read commithash; do
    git diff-tree -r --name-only --diff-filter=AMT $commithash |
        tail -n +2 | while read path; do
            objsize=$(git cat-file -s "$commithash:$path")
            [ $objsize -lt $BIG ] && continue
            echo "$blobsize byte file added at commit $commithash as $path"
        done
done

git diff-tree needs -r to work recursively (same as git ls-tree), needs --name-only to print only file names, and needs --diff-filter=AMT to print only the names of files added, modified, or type-changed (from symlink to file or vice versa). Obnoxiously, git diff-tree prints the commit ID again as the first line. We can suppress the ID with --no-commit-id but then we get a blank line, so we might as well just use tail -n +2 to skip the first line.

The rest of the script is the same as yours, except that we get the object's size the easy way, using git cat-file -s, and test it directly with the [ / test program.

Note that with merge commits, git diff-tree (like git show) uses a combined diff, showing only files that, in the merge result, don't match either parent. This should be OK since if file huge is 4GB in the merge result but is identical to file huge that was 4GB in one of the two merged commits, we'll see huge when it's added to that commit, instead of seeing it in the merge itself.

(If that's not desirable, you can add -m to the git diff-tree command. However, then you'll need to drop the tail -n +2 and put in the --no-commit-id, which behaves differently under -m. This particular behavior in Git is somewhat annoying, although it makes sense with the default output format, which is similar to git log --raw.)

(NB: code above is not tested - spotted and fixed $hash vs $commithash on last re-read.)

这篇关于如何获取整个git历史记录中每个文件的大小?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆