测量“接近度”在大型树源中 [英] Measuring "closeness" in large source trees

查看:105
本文介绍了测量“接近度”在大型树源中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

作为我之前提出的有关寻找最佳匹配的问题的一部分在两个来源之间,其中一个拥有活跃的git回购,另一个没有git历史,我写了一个Perl脚本来找到最接近的git提交



我正在重写脚本,您不必猜测要使用哪个分支,但它会遍历并在所有分支中找到最接近的匹配,然后告诉您最佳分支的最佳提交。不幸的是,我发现我使用的衡量标准可能不是贴近度的最佳评判标准。



目前,我使用 diff -burN -x.git my_git_subtree my_src_subtree | wc -l <​​/ code>来确定代码树的距离。这似乎工作或多或少,但我遇到了整个文件夹添加或丢失,可能存在或不存在于另一个分支的情况。



是有更好的方法来确定来源有多近?我想象的是比较目录结构的东西,可能以及有多少行不同。它可能只是将不同的参数传递给 diff ,或者也可能有其他工具可以做到这一点。



  1个文件已更改,1个插入(+),2个删除( - )

您可以根据结果调整文件更改/插入/删除的优先顺序。



看看你的perl,我认为你可能无法对提交中的接近顺序做出假设 - 你可能需要蛮力检查每个提交,或者至少做一个选项。



我也建议您不要寻找最接近的,而是保留一个排序列表(commit,closeness)对,或许显示前几名并手动查看。如下所述,仅通过查看更改次数来确定两组代码是否接近或没有银弹。也就是说,更改数量绝对可以帮助您缩小应该查看的列表的范围...



更新:我还应该提到另一个使用git diff的好处是你不必为每个提交运行硬重置。简单地将你的未知树的git /目录(一个没有git的历史记录)符号链接起来,然后使用git reset [--mixed],它会更新当前的头指针,但是保持你的源码不变(显然需要备份未知的源代码树在使用此方法之前)。


As part of a question I posed earlier about finding the best match between two sources, where one has an active git repo and the other has no git history, I wrote a perl script to find the closest git commit.

I'm in the process of rewriting the script so that you don't have to guess at which branch to use, but it will run through and find the closest match in all branches, then tell you the best commit with the best branch. Unfortunately, I'm finding that the measurement I'm using may not be the best judge of "closeness."

Currently, I use diff -burN -x.git my_git_subtree my_src_subtree | wc -l to determine how close the code trees are. This seems to work more-or-less but I run into cases where entire folders are added or missing, that likely exist or don't exist in another branch.

Is there a better way to determine how close the sources are? I'm envisioning something that compares the directory structures, possibly as well how many lines are different. It could just be a matter of passing different params to diff, or maybe there is another tool out there that does something like that.

解决方案

To improve on your measurement, why not try 'git diff --shortstat' ? The output looks like this:

 1 file changed, 1 insertion(+), 2 deletions(-)

You can play around with how to prioritize files changes / insertions / deletions, depending on results.

Looking at your perl, I think you're probably not going to be able to make assumptions about the ordering of "closeness" among commits -- you may need to brute force check every commit, or at least make that an option.

I'd also suggest that instead of looking for the closest, you keep a sorted list of (commit, "closeness") pairs and perhaps display the top few and review them by hand. As mentioned below, there is no silver bullet for determining whether two sets of code are close or not simply by looking at the number of changes. That said, number of changes can definitely help you narrow down the list you should review...

UPDATE: I should also mention that another advantage of using git diff is that you don't have to run a hard reset for each commit. Simply symlink the .git/ directory from your unknown tree (the one w/o a git history), and use git reset [--mixed] and it will update the current head pointer but leave your source unchanged (obviously need to backup the unknown source tree before using this method).

这篇关于测量“接近度”在大型树源中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆