Git:共享本地存储(使用硬链接) [英] Git: cloning with shared local storage (using hard links)
问题描述
我想让大量的开发者很容易反复克隆一个非常大的远程git仓库。某种本地每用户缓存是必要的。显然有很多方法可以做到这一点,我只是感到惊讶,似乎在我看来最自然的一种方式在git中不存在。
有没有这方面的行业标准做法?
有一些git方案,我只是误解?
理想的解决方案
#first clone - 非常慢。
git克隆ssh://remote.repo/repo.git repo1
#subsequent克隆 - 快如闪电
git克隆 - 共享与硬链接repo1 ssh://远程。 repo / repo.git repo2
在这个想象的解决方案中,没有。 git / objects / info / alternates
创建,对象只是在使用硬链接的克隆上共享,比如rsync的 - link-dest
选项或类似git的克隆当回购是在本地文件系统。
我看到的替代方案,都没有那么吸引人:
- 我可以使用
git clone --reference repo1 ssh://remote.repo/repo.git repo2
,它依赖于repo1存在且如果repo1被删除,那么repo2将被放弃。 - 我可以做
git clone --dissociate --reference repo1 ssh://remote.repo/repo.git repo2
但存储不共享,所以现在我使用了两倍的存储空间,而且可能还是比较慢。 - 哈复杂程度不尽相同,可能需要围绕克隆和拉动进行包装。与真正的编程相比,复杂性显然微不足道,但是在一堆包装下运行你的SCM只是一个应该避免的麻烦。
- 将git'cache'回购存储在每台开发人员个人电脑的中央位置,并有一个克隆包装以便首先自动获取缓存,然后
克隆 - 参考<缓存>
。 - 记住每个完成的克隆,随后的克隆将查找预先存在的本地克隆并从中本地克隆(创建硬链接),然后修复遥控器。粗略地说,它是这样的:
- 将git'cache'回购存储在每台开发人员个人电脑的中央位置,并有一个克隆包装以便首先自动获取缓存,然后
。
#找到所有现有的克隆... repo1
git clone / path / to / repo1 repo2
git remote rm origin
git remote add origin ssh://remote.repo/repo.git
git fetch
#Abandon在其他工作区中进行的任何本地修改
for ref in $(git --git -dir$ gitdirfor-each-ref refs / heads --format%(refname));做
refbase = $(basename $ ref)
run_cmd git --git-dir$ gitdirupdate-ref $ ref remotes / origin $ refbase
done
但是这一切似乎都像是黑客。肯定有更好的方法吗?
谢谢,
Mort
p>
- 实际上我们有一个LAN本地镜像。回购足够大,我们需要更好地实现合理的克隆速度。
- 回购很大。 11分钟即可在GigE上克隆,如果用户使用Windows,则可以多达40分钟。 b / p>
我能想到的最好的办法是在
/var/cache/git/<repo_name>.git
即中央仓库的clone --mirror
。新克隆使用- 共享
选项来减少初始克隆中的空间/时间,并加速后续的fetch
ES。有一个包装脚本clone
这是一个新的工作区:
git --git-dir /var/cache/git/<repo_name> git远程更新
git clone --shared /var/cache/git/<repo_name> ;.git
git remote set-url origin ssh://remote.repo/repo.git
我宁愿选择那些依赖于硬链接,因为如果对象以某种方式从共享缓存中删除,它们不受问题的困扰。但我猜这是不存在的。
解决方案Git在克隆本地存储库时默认为hardlink。因此,您可以将
git clone / path / to / repo / path / to / clone
cd / path / to / clone
git remote add upstream http://example.com/path/to/repo/to/clone
git fetch upstream
但是这有一些缺点:
- 下一个
git gc
会破坏硬链接并占用你的磁盘空间。 - 只有在
/ path / to / repo
和/ path / to / clone
在同一个分区上。 -
您必须小心使用结果上的工具,例如没有
-H
的rsync
会复制所有的硬链接。
我认为在大多数情况下
.git / objects / info / alternates
会更好。 $ b
I'd like to make it easy for a large number of devs to repeatedly clone a very large and remote git repo. Some sort of local per-user 'caching' is necessary. There are obviously lots of ways to do this, I'm just surprised that it seems as if the one way that would seem most natural to me does not exist in git.
Is there an industry standard practice on this?
Is there some git option that I'm just misunderstanding?Ideal solution
#first clone - very slow. git clone ssh://remote.repo/repo.git repo1 #subsequent clones - lightning fast git clone --shared-with-hard-links repo1 ssh://remote.repo/repo.git repo2
In this imaginary solution, there is no
.git/objects/info/alternates
created, objects are just shared on clone using hard links, like rsync's--link-dest
option, or like git's clone when the repo is on the local filesystem.The alternatives I see, are none of them that attractive:
- I can do
git clone --reference repo1 ssh://remote.repo/repo.git repo2
which relies on repo1 existing and if repo1 is deleted, then repo2 is fubared. - I can do
git clone --dissociate --reference repo1 ssh://remote.repo/repo.git repo2
but storage is not shared so now I've used up twice the storage I want, and it's probably still relatively slow for that reason. - There are various hacks of varying complexity that may need wrappers around cloning and pulling. The complexity is, compared to real programming, obviously trivial, but running your SCM under a bunch of wrappers is just a nuisance that should really be avoided.
- Store a git 'cache' repo in a central location on each dev's PC and have a wrapper around clone to automatically fetch on the cache first and then
clone --reference <cache>
. - Remember every clone that is done and subsequent clones will look for a pre-existing local clone and clone locally from that (creating hard links) and then fix up the remotes after that. Roughly, it goes something like this:
- Store a git 'cache' repo in a central location on each dev's PC and have a wrapper around clone to automatically fetch on the cache first and then
.
#find any existing clones... repo1 git clone /path/to/repo1 repo2 git remote rm origin git remote add origin ssh://remote.repo/repo.git git fetch #Abandon any local changes made in the other workspace for ref in $(git --git-dir "$gitdir" for-each-ref refs/heads --format "%(refname)" ) ; do refbase=$(basename $ref) run_cmd git --git-dir "$gitdir" update-ref $ref remotes/origin $refbase done
But it all seems like a hack. Surely there's a better way?
Thanks,
MortNotes:
- We actually do have a LAN-local mirror. The repo is large enough, that we need better than just that to achieve reasonable clone speeds.
- The repo is big. 11 min to clone over GigE and up to 40 min if the user is on Windows.
Update
The best thing that I can figure out to do is to have a cache in
/var/cache/git/<repo_name>.git
that is aclone --mirror
of the central repo. New clones use the--shared
option to both reduce space/time in the initial clone and to speed up subsequentfetch
es. There is a wrapper script toclone
a new workspace that does this:git --git-dir /var/cache/git/<repo_name>.git remote update git clone --shared /var/cache/git/<repo_name>.git git remote set-url origin ssh://remote.repo/repo.git
I would have preferred something that relied on hard links because they are immune to issues if objects are somehow removed from the shared cache. But I guess that does not exist.
解决方案Git does hardlink by default when you clone a local repository. So, you can
git clone /path/to/repo /path/to/clone cd /path/to/clone git remote add upstream http://example.com/path/to/repo/to/clone git fetch upstream
But this has a number of disadvantages:
- The next
git gc
will break hardlinks and eat your disk space. - This will work only if
/path/to/repo
and/path/to/clone
are on the same partition. You have to be careful with the tools you use on the result, e.g. a
rsync
without-H
will copy all hardlinks.I think the
.git/objects/info/alternates
is much better in most cases.
这篇关于Git:共享本地存储(使用硬链接)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- 下一个