意外提交的敏感信息-GitLab [英] Accidentally committed sensitive information - GitLab

查看:148
本文介绍了意外提交的敏感信息-GitLab的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不小心提交了包含敏感数据的文件.我需要通过删除敏感数据来更新该文件,并确保较旧的版本不会出现在历史记录中.

我了解到那些在本地克隆了回购协议的人仍然可以访问它.但是,一旦他们获取了最新信息,是否可以通过某种方式进行设置,使他们不会看到敏感数据前进或无法在日志中看到它们?

解决方案

虽然GitLab通常不像GitHub那样公开,但有关数据的一般规则适用于此:如果您已将敏感/机密数据提供给了无法信任的人,您的秘密已经泄露了,您应该停止依赖它.

这意味着关键问题不是(或至少现在不是)我如何说服GitLab忘记我的秘密",而是我完全,完全信任GitLab服务器以及拥有此功能的其他所有人吗?一直都可以访问这些服务器?"如果答案为否",则无论如何都必须停止使用此机密.

也就是说,这是有关 Git本身存储数据的规则.假设您的GitLab服务器仅使用 Git(并且没有在其上构建其他一些东西,它们可能会增加访问数据的更多方式,从而为敏感/机密数据提供了更多方式)泄漏),您要做的就是说服GitLab服务器执行与您自己的Git中相同的操作.

Git的基础存储模型是存储库是Git称为对象的集合.每个对象都有一个唯一的哈希ID,并且是以下四种类型之一: blob tree commit 带注释的标签. blob 大致是文件数据.如果敏感/机密数据位于文件内部,则它们实际上位于blob对象中.一棵配对—好于 pair ,但是现在让我们使用它 1 -每个文件的名称带有其Blob哈希ID,因此,如果文件的 name 是敏感/机密数据,则您的机密实际上位于树对象中. commit 对象包含您的姓名,电子邮件地址,时间戳,日志消息以及某些先前的或 parent 提交的哈希ID,以及包含构成提交的快照的文件的树.一个带有注释的标签对象与提交保持着几乎相同的区别,只是它通常具有提交的哈希ID而不是树对象.通常在这里存储一个PGP签名,将某些特定的提交标记为祝福",例如,称为2.3.4版或任何其他版本.

假设您的机密信息位于一个特定文件中,而该文件本身的名称不是秘密文件,那么此时的目标是使Git停止使用保存该特定文件数据的Blob.为此,必须使对象本身成为未引用,然后使用git gc使Git物理删除未引用的对象.在这一点上,通常对 reachability 稍作准备是有用的,但我会将其外包给像(a)Git一样.让我们在这里说,通常,在您意外提交了一些秘密文件之后,Git查找 commit 对象的方式就是使用分支名称:

... <-F <-G <-H   <--master

名称 master包含提交H哈希ID .提交H包含其父提交(提交G)的哈希ID,因此,对于Git来说,找到提交G的方法是先读取名称master(产生哈希ID H),然后再读取数据库中的commit对象(生成一个 tree 对象和一个 parent 提交哈希G,以及日志消息以及您的姓名和电子邮件地址,等等) ,将除G的散列之外的所有内容丢弃,然后从数据库中读取实际的提交对象G.如果您已要求Git从 提交G来获取某个特定文件(或更确切地说,该文件的内容),则它会使用G的树来查找包含以下内容的Blob的哈希ID该文件,然后从数据库中获取blob对象,现在Git拥有了内容.

因此,假设您的机密数据位于附加到提交H的树的blob中,而这些相同的数据不在 any 其他文件中-因此,没有树附加到任何 other 提交将具有该Blob的哈希ID.然后,要使H本身不被引用,只需使名称master指向G而不是H:

git checkout master
git reset --hard HEAD~1

现在您拥有:

...--E--F--G   <-- master
            \
             H   [abandoned]

但是,虽然H没有持有其哈希ID的显而易见的名称,但我们尚未完成:git gc不会-至少还没有 -删除H,这就是事情开始变得复杂的地方.

如果H中有有价值的文件,我们可以使用git commit --amendH放在一边,以提交新的提交I,其父文件为G而不是H,并具有指向I:

... edit files, git add, git commit --amend ...

给予:

             H   [abandoned]
            /
...--E--F--G--I   <-- master


1 从技术上讲,每个树条目都有:

  • 条目的mode,如100755100644的文本字符串.如果该条目用于子树,则字符串为40000.
  • 包含文件名的字节字符串,通常采用UTF-8编码
  • 条目附带的哈希ID

(模式和名称用空格分隔,名称以ASCII NUL终止,而哈希ID用20个二进制字节编码.当Git切换到SHA-256时,这将不得不更改.我认为新格式尚未确定,但是它可能很简单,例如,使用0n模式,其中n是版本号,因为该模式为八进制且前导零被抑制,因此现有的树都不会以01作为模式.或者,它可能是NUL字节,后跟版本号,因为这也是当前无效的树条目.)因此,对于子目录,树仅列出子树,对于常规文件,有两个值加一个散列.对于符号链接,哈希ID仍然是对象的ID,但是对象的 content 是符号链接的 target ;对于子模块的gitlinks,哈希ID是 commit 子模块中Git应该git checkout的ID.


主要并发症是reflogs

Git确实为您记住H的部分,甚至在您git reset离开之后,Git仍将其称为 reflogs .引用日志会记住引用的上一个值.也就是说,在我们git reset之前,分支名称master可能现在指向H .然后,在我们使用git reset --hardgit commit --amend放弃提交H之后,它立即指向GI .但是它用来指向H,所以H的哈希ID在名称master的引用日志中.

@{1}@{yesterday}语法是告诉Git查找这些reflog值的方式.编写master@{1}会告诉您的Git:在我的master引用日志中查找,并立即获取master的前一个值.该条目存在的事实将使您的Git保留提交,这将使您的Git保留包含秘密的消息.

实际上至少有两个 引用日志包含提交H的哈希ID:一个用于master,在master@{1}中,另一个用于HEAD本身.因此,如果您要说服Git真正放弃提交H,从而放弃H的树,并因此放弃H的树所独有的blob,则必须使这些reflog条目消失. /p>

通常,它们通常会在30天后自行消失.发生这种情况是因为每个reflog条目也都带有时间戳,并且git reflog expire将基于此时间戳(相对于计算机的当前时间)过期(并删除)旧的reflog条目. master git gc命令为您运行git reflog expire,并将其设置为默认情况下在30天内到期无法访问的提交 2 . (可实现的提交默认情况下为90天.)因此,在自己的 Git上,您需要运行:

git reflog expire --expire-unreachable=now --all

告诉您的Git:查找所有H之类的无法到达的提交,并立即终止其reflog条目.


2 从技术上讲,它是从参考的当前值不可访问的.也就是说,Git不会在这里测试全局可访问性,而是进行一个稍微简单的测试:此reflog入口是否指向一个提交,该提交是引用本身现在指向的提交的祖先? /em>


次要并发症是对象修剪宽限期

即使在HEAD和分支名称中的reflog条目都过期后,您也会发现自己的git gc不会立即丢弃该blob对象.原因是所有 Git对象都有一个宽限期,在此期间git gc不会修剪它们.默认的宽限期是14天.这使 all Git命令有一段时间可以创建对象而不必担心,只要它们通过链接所有这些对象完成在14天之内完成所有工作即可.到提交或标记对象或其他任何对象,并使用适当的引用名称(例如分支或标记名称)记录该对象的哈希ID.

要使您误用H提交的Blob消失,那么,您不仅需要使无法访问的reflog条目到期,而且还告诉Git修剪对象,即使它们为 zero .几天前:

git prune --expire=now

此修剪步骤是git gc实际删除对象的一部分,因此通过运行git prune,您无需运行git gc. (git gc还会使reflog到期,依此类推,但是要协调所有操作以确保Git具有这些宽限期.由于我们绕过了所有宽限期,所以我们也绕过了git gc.)

请确保执行此操作时没有其他Git命令正在运行,因为它们可能正在创建对象,它们希望在完成工作时可以保留14天.

最后一个麻烦是打包文件

如果您的秘密存储在Git所谓的 loose 对象中,则上述步骤就足够了:该对象将完全消失,并且:

git rev-parse <hash-ID>

将不再找到该对象.该Git存储库中的任何地方都不再可用.

但并非所有对象都是松散的.最终,为了节省空间,Git将这些松散的对象打包打包文件.打包文件中存储的对象将与同一打包文件中的其他对象进行压缩. 3 在这种情况下,如果您的机密数据已打包,则可以从打包文件中检索它们.

这通常不会很快发生,因此在包文件中很少出现刚刚提交的秘密.但是,如果已经发生了,清理它的唯一方法就是让Git 重新打包所有现有的打包文件.也就是说,您将让Git将包分解成它们的组成的松散对象,然后扔掉不需要的对象,然后构建一个新的(通常是单个)包文件-或至少使用具有这种效果的过程.用于重建软件包的Git命令是git repack,它具有很多选项.由于时间有限,我在这里不再赘述.


3 精简包中,对象可能会针对存储库中其他不是打包文件中的 而是精简包的对象进行压缩仅用于获取和推送操作,然后通过添加缺少的碱基来整理"它们.


服务器通常没有reflog

要处理所有这些问题,您需要能够登录到GitLab服务器,因为这些维护Git命令(以及BFG,请参见下文)均不能通过获取或推送调用.特别是,虽然可以从客户端使用git push -f来使服务器上的名称master不再指向提交H,但是您不能调用git prune来使松散的对象消失.

如果以及当您执行登录服务器时,可以检查是否在此处为您的存储库启用了reflog.如果不是,则无需执行任何更新.您还可以通过查看.git/objects目录来查看对象是否松动或打包.如果您的Blob哈希ID是0123456789...,它将存在于名为.git/objects/01/23456789...的文件中.取消引用和修剪后,该文件将消失,您将完成操作.

使用BFG回购清洁器

使用 BFG回购清洁器,您可以避免很多麻烦. BFG无论如何都不尊重任何宽限期,因为它有不同的目的.这也可以解决任何打包文件问题.与其他方法一样,此方法必须在服务器上运行,并且具有自己的怪癖(请参阅链接的问题和解答).

I accidentally committed a file with sensitive data. I need to update that file by removing the sensitive data and ensure the older version doesn't show up in the history.

I understand that those who have the repo cloned locally will still have access to it. But once they pull the latest, can it be setup in a way that they will not see the sensitive data moving forward or will not be able to see it in the logs?

解决方案

While GitLab is not generally as public as GitHub, the general rules about data apply here: if you've given sensitive / secret data to someone who cannot be trusted, your secret is already out and you should stop depending on it.

That means the key question is not—or at least, not yet—"how do I convince GitLab to forget my secrets" but rather "do I completely, totally trust both the GitLab server(s) and everyone else that has had access to those server(s) all this time?" If the answer is "no" you must stop depending on this secret anyway.

That said, here are rules about how Git itself stores the data. Assuming your GitLab server(s) is/are using only Git (and not some additional things built atop them that may add yet more ways to access the data that provide even more ways for your sensitive / secret data to leak), all you have to do is convince the GitLab servers to do the same thing you would do in your own Git.

Git's underlying storage model is that a repository is a collection of what Git calls objects. Each object has a unique hash ID, and is one of four types: blob, tree, commit and annotated tag. A blob is, roughly, file data. If the sensitive / secret data are inside a file, they are actually inside a blob object. A tree pairs up—well, more than pair, but let's use that for now1—each file's name with its blob hash ID, so if the file's name is the sensitive / secret data, your secret is actually inside a tree object. A commit object contains your name, email address, time stamp, log message, and the hash ID of some previous or parent commit(s), along with the hash ID of the tree that holds the files that make up the snapshot that is that commit. An annotated tag object holds much the same as a commit except that instead of a tree object, it usually has the hash ID of a commit; this is where one usually stores a PGP signature marking some particular commit as "blessed" and, say, called version 2.3.4 or whatever.

Assuming your secrets are in one particular file, whose name itself is not secret, your goal at this point is to cause your Git to stop using the blob that holds that particular file's data. To do so, you must cause the object itself to become unreferenced, and then use git gc to make Git physically remove the unreferenced object. At this point, a long aside into reachability in general is useful, but I'll outsource it to Think Like (a) Git. Let's just say here that in general, right after you've accidentally committed some secret file, the way that Git finds the commit object is using a branch name:

... <-F <-G <-H   <--master

The name master contains the hash ID of commit H. Commit H contains the hash ID of its parent commit, commit G, so for Git to find commit G, it starts by reading the name master (which produces hash ID H) and then reading the commit object from the database (which produces one tree object and one parent commit hash, G, along with the log message and your name and email address and so on), throws out all but the hash of G, and then reads the actual commit object G from the database. If you have asked Git to get some particular file—or more precisely, that file's content—from commit G, it then uses G's tree to find the hash ID of the blob containing that file, then gets the blob object from the database, and now Git has the content.

So, suppose your secret data are in a blob attached to a tree attached to commit H, and those same data are not in any other file—so that no tree attached to any other commit will have the hash ID of that blob. Then, to make H itself unreferenced, just make the name master point to G instead of H:

git checkout master
git reset --hard HEAD~1

Now you have:

...--E--F--G   <-- master
            \
             H   [abandoned]

But while H has no obvious name holding its hash ID, we're not yet done: git gc won't—at least not yet—remove H, and here's where things start to get complicated.

If there are valuable files in H, we can push H aside, using git commit --amend, to make a new commit I whose parent is G instead of H, and have master point to I:

... edit files, git add, git commit --amend ...

giving:

             H   [abandoned]
            /
...--E--F--G--I   <-- master


1Technically, each tree entry has:

  • the entry's mode, a text string like 100755 or 100644. The string is 40000 if the entry is for a sub-tree.
  • a string of bytes holding the file's name, generally in UTF-8 encoding
  • the hash ID that goes with the entry

(The mode and name are separated by a space, and the name is terminated by an ASCII NUL, while the hash ID is encoded in 20 binary bytes. This is going to have to change when Git switches to SHA-256. I don't think the new format is as-yet decided, but it could be as simple as, say, using a mode of 0n where n is a version number, as the mode is in octal with leading zeros suppressed, so no existing tree will have 01 as a mode. Or, perhaps it might be a NUL byte followed by a version number, since that too is currently an invalid tree entry.) Hence for sub-directories, the tree just lists sub-trees, and for regular files there are two values plus a hash. For symlinks, the hash ID is still that of a blob, but the blob's content is the target of the symbolic link; and for gitlinks for submodules, the hash ID is that of the commit Git should git checkout in the submodule.


The main complication is reflogs

The part of Git that does remember H for you, even after you git reset it away, is what Git calls reflogs. A reflog remembers the previous values of a reference. That is, the branch name master might point to H right now, before we git reset it. Then it points to G or I right now, after we use git reset --hard or git commit --amend to discard commit H. But it used to point to H, so H's hash ID is in the reflog for the name master.

The @{1} or @{yesterday} syntax is how you tell Git to look up these reflog values. Writing master@{1} tells your Git: look in my master reflog, and get me the immediately-previous value of master. The fact that this entry exists will make your Git retain commit H which will make your Git retain the blob containing the secret.

There are in fact at least two reflogs containing the hash ID of commit H: one for master, in master@{1}, and one for HEAD itself. So if you are to convince your Git to really discard commit H, and hence discard the tree for H, and hence discard any blobs unique to the tree for H, you must make these reflog entries go away.

Normally, they go away on their own, generally after about 30 days. This happens because each reflog entry has a time-stamp as well, and git reflog expire will expire—and remove—old reflog entries based on this time-stamp, vs the current time on your computer. The master git gc command runs git reflog expire for you, and sets it up to expire unreachable commits2 in 30 days by default. (Reachable commits get 90 days by default.) So on your own Git, you would need to run:

git reflog expire --expire-unreachable=now --all

to tell your Git: Find all unreachable commits like H and expire their reflog entries now.


2Technically, it's unreachable from the current value of the reference. That is, Git is not going to test a global reachability here, but rather do a somewhat simpler test: does this reflog entry point to a commit that is an ancestor of the commit to which the reference itself points right now?


The secondary complication is the object-prune grace time

Even after expiring the reflog entries from both HEAD and the branch name, you'll find that your own git gc does not immediately discard the blob object. The reason is that all Git objects have a grace period during which git gc won't prune them away. The default grace period is 14 days. This gives all Git commands some time during which they can create objects without worrying about them, as long as they finish all their work within that 14 day period by linking all those objects up into a commit or tag object or whatever, and making an appropriate reference name (such as a branch or tag name) record the hash ID of that object.

To make the blob you accidentally committed with H go away, then, you not only need to expire the unreachable reflog entries, but also tell Git to prune objects even if they're zero days old:

git prune --expire=now

This prune step is the part of git gc that actually removes the object, so by running git prune, you remove the need to run git gc. (git gc also runs the reflog expire and so on, but coordinates everything to make sure Git has these grace periods. Since we're bypassing all the grace periods, we just bypass git gc as well.)

Make sure no other Git commands are running when you do this, since they may be creating objects that they expect to persist for 14 days while they get their work done.

The last complication is pack files

If your secret is stored in what Git calls a loose object, the above steps suffice: the object will be completely gone, and:

git rev-parse <hash-ID>

will no longer find the object at all. It's no longer available anywhere in this Git repository.

But not all objects are loose. Eventually, to save space, Git packs these loose objects into pack files. Objects stored inside pack files get compressed against other objects in the same pack file.3 In this case, if your secret data have become packed, it's possible to retrieve them from the pack file.

This usually doesn't happen quickly so it's rare to have a just-committed secret wind up in a pack file. But if it has happened, the only way to clean it up is to make Git re-pack all the existing pack files. That is, you would have Git explode the packs into their constituent loose objects, then toss the unwanted object, then build a new (usually single) pack file—or use a process that has that effect, at least. The Git command to rebuild the packs is git repack and it has a lot of options. I'm not going to go into any more detail here as I'm out of time.


3In thin packs, objects may be compressed against other objects in the repository that are not in the pack file, but thin packs are used only for fetch and push operations, after which they're "fattened up" by adding the missing bases back.


Servers often don't have reflogs

To deal with all of this, you need to be able to log into your GitLab server(s), as none of these maintenance Git commands (nor the BFG, see below) can be invoked via fetch or push. In particular, while you can use git push -f from your client to make the name master on the server no longer point to commit H, you cannot invoke git prune to make a loose object go away.

If and when you do log into the server, you can check whether reflogs are enabled for your repository there. If not, there's no need to do any reflog expiry. You can also see whether your object is loose or packed by looking into the .git/objects directory. If your blob hash ID is, say, 0123456789... it will live in a file named .git/objects/01/23456789.... Once it's unreferenced and pruned, the file will be gone and you will be done.

Using The BFG repo cleaner

You can avoid a lot of complications by using the BFG repo cleaner. BFG does not respect any of the grace periods anyway, since it has a different purpose. That also takes care of any pack file issues. Like the other method, this must be run on the server, and it has its own quirks (see the linked question and answers).

这篇关于意外提交的敏感信息-GitLab的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆