如何确定几个字符串的最长的相似部分? [英] How do I determine the longest similar portion of several strings?
问题描述
按照标题,我试图找到一种方法以编程方式确定几个字符串之间相似的最长的部分。
As per the title, I'm trying to find a way to programmatically determine the longest portion of similarity between several strings.
例如:
- 文件:///home/gms8994/Music/t.A.T.u./
- 文件:///家用/ gms8994 /音乐/尼娜%20sky /
- 文件:///家用/ gms8994 /音乐/ A%20Perfect%20Circle /
在理想情况下,我会回来的文件:///家用/ gms8994 /音乐/,因为这是最长的部分这就是3串之间的相似
Ideally, I'd get back file:///home/gms8994/Music/, because that's the longest portion that's similar between all 3 strings.
具体而言,我正在寻找一个Perl的解决方案,但在任何语言中的溶液(甚至是伪语言)就足够了。
Specifically, I'm looking for a Perl solution, but a solution in any language (or even pseudo-language) would suffice.
从注释:是的,只有开头;但存在具有列表中的其他一些条目,这将被忽略这个问题的可能性。
From the comments: yes, only at the beginning; but there is the possibility of having some other entry in the list, which would be ignored for this question.
推荐答案
编辑:我很抱歉的错误。我遗憾的是,我监督,使用我的
是很大的错误。此字符串内基准模块评估,并@str是空在那里。该解决方案是不是和我一样快presented。请参见下面的修正。对不起了。 countit里面
变量(X,Q {})
I'm sorry for mistake. My pity that I overseen that using my
variable inside countit(x, q{})
is big mistake. This string is evaluated inside Benchmark module and @str was empty there. This solution is not as fast as I presented. See correction below. I'm sorry again.
Perl可以快速:
use strict;
use warnings;
package LCP;
sub LCP {
return '' unless @_;
return $_[0] if @_ == 1;
my $i = 0;
my $first = shift;
my $min_length = length($first);
foreach (@_) {
$min_length = length($_) if length($_) < $min_length;
}
INDEX: foreach my $ch ( split //, $first ) {
last INDEX unless $i < $min_length;
foreach my $string (@_) {
last INDEX if substr($string, $i, 1) ne $ch;
}
}
continue { $i++ }
return substr $first, 0, $i;
}
# Roy's implementation
sub LCP2 {
return '' unless @_;
my $prefix = shift;
for (@_) {
chop $prefix while (! /^\Q$prefix\E/);
}
return $prefix;
}
1;
测试套件:
#!/usr/bin/env perl
use strict;
use warnings;
Test::LCP->runtests;
package Test::LCP;
use base 'Test::Class';
use Test::More;
use Benchmark qw(:all :hireswallclock);
sub test_use : Test(startup => 1) {
use_ok('LCP');
}
sub test_lcp : Test(6) {
is( LCP::LCP(), '', 'Without parameters' );
is( LCP::LCP('abc'), 'abc', 'One parameter' );
is( LCP::LCP( 'abc', 'xyz' ), '', 'None of common prefix' );
is( LCP::LCP( 'abcdefgh', ('abcdefgh') x 15, 'abcdxyz' ),
'abcd', 'Some common prefix' );
my @str = map { chomp; $_ } <DATA>;
is( LCP::LCP(@str),
'file:///home/gms8994/Music/', 'Test data prefix' );
is( LCP::LCP2(@str),
'file:///home/gms8994/Music/', 'Test data prefix by LCP2' );
my $t = countit( 1, sub{LCP::LCP(@str)} );
diag("LCP: ${\($t->iters)} iterations took ${\(timestr($t))}");
$t = countit( 1, sub{LCP::LCP2(@str)} );
diag("LCP2: ${\($t->iters)} iterations took ${\(timestr($t))}");
}
__DATA__
file:///home/gms8994/Music/t.A.T.u./
file:///home/gms8994/Music/nina%20sky/
file:///home/gms8994/Music/A%20Perfect%20Circle/
测试套件结果:
Test suite result:
1..7
ok 1 - use LCP;
ok 2 - Without parameters
ok 3 - One parameter
ok 4 - None of common prefix
ok 5 - Some common prefix
ok 6 - Test data prefix
ok 7 - Test data prefix by LCP2
# LCP: 22635 iterations took 1.09948 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) @ 20766.06/s (n=22635)
# LCP2: 17919 iterations took 1.06787 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) @ 16746.73/s (n=17919)
这意味着,使用纯Perl的解决方案 SUBSTR
比<更快的约20% href="http://stackoverflow.com/questions/499967/programmatically-determining-the-longest-similar-portion-of-several-strings/500100#500100">Roy's 的解决方案,在您的测试案例和一个preFIX发现大约需要50微秒。有使用XS,除非你的数据或业绩预期有较大是没有必要的。
That means that pure Perl solution using substr
is about 20% faster than Roy's solution at your test case and one prefix finding takes about 50us. There is not necessary using XS unless your data or performance expectations are bigger.
这篇关于如何确定几个字符串的最长的相似部分?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!