如何在 Perl 中有效地计算覆盖给定范围的范围? [英] How can I efficiently count ranges that cover a given range in Perl?

查看:44
本文介绍了如何在 Perl 中有效地计算覆盖给定范围的范围?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含大约 30k 个范围的数据库,每个范围都以一对起点和终点的形式给出:

I have a database of some 30k ranges, each is given as a pair of start and end points:

[12,80],[34,60],[34,9000],[76,743],...

我想编写一个 Perl 子例程,它是一个范围(不是来自数据库),并返回数据库中完全包含"给定范围的范围数.

I would like to write a Perl subroutine that a range (not from the database), and returns the number of ranges in the database which fully 'include' the given range.

例如,如果我们在数据库中只有这 4 个范围并且查询范围是 [38,70],那么子程序应该返回 2,因为第一个和第三个范围都完全包含查询范围.

For example, if we had only those 4 ranges in the database and the query range is [38,70], the subroutine should return 2, since the first and third ranges both fully contain the query range.

问题:我希望查询尽可能便宜",如果有帮助,我不介意进行大量预处理.

The problem: I wish to make the queries as "cheap" as possible, I don't mind doing lots of pre-processing, if it helps.

一些注意事项:

  1. 我随意使用了数据库"这个词,我不是指实际的数据库(例如 SQL);这只是一长串范围.

  1. I used the word "database" freely, I don't mean an actual database (e.g. SQL); it's just a long list of ranges.

我的世界是圆形的...有一个给定的max_length(例如9999)和范围如[8541,6] 是合法的(您可以将其视为 [8541,9999][1,6] 的并集的单个范围).

My world is circular... There is a given max_length (e.g. 9999) and ranges like [8541,6] are legal (you can think of it as a single range that is the union of [8541,9999] and [1,6]).

谢谢,戴夫

更新这是我的原始代码:

use strict;
use warnings;

my $max_length = 200;
my @ranges     = (
    { START => 10,   END => 100 },
    { START => 30,   END => 90 },
    { START => 50, END => 80 },
    { START => 180,  END => 30 }
);

sub n_covering_ranges($) {
    my ($query_h) = shift;
    my $start     = $query_h->{START};
    my $end       = $query_h->{END};
    my $count     = 0;
    if ( $end >= $start ) {

        # query range is normal
        foreach my $range_h (@ranges) {
            if (( $start >= $range_h->{START} and $end <= $range_h->{END} )
                or (    $range_h->{END} <= $range_h->{START} and  $range_h->{START} <= $end )
                or ( $range_h->{END} <= $range_h->{START} and  $range_h->{END} >= $end)
                )
            {
                $count++;
            }
        }

    }

    else {

        # query range is hanging over edge
        # only other hanging over edges can contain it
        foreach my $range_h (@ranges) {
            if ( $start >= $range_h->{START} and $end <= $range_h->{END} ) {
                $count++;
            }
        }

    }

    return $count;
}

print n_covering_ranges( { START => 1, END => 10 } ), "\n";
print n_covering_ranges( { START => 30, END => 70 } ), "\n";

而且,是的,我知道 if 很丑,可以做得更好更高效.

and, yes, I know the ifs are ugly and can be made much nicer and more efficient.

更新 2 - 基准建议解决方案

到目前为止,我已经为两个目的解决方案做了一些基准测试:天真的,由 cjm 建议,类似于我原来的解决方案,以及 内存要求高的问题,由亚里士多德·帕加尔齐斯提出 再次感谢给你们俩!

I've don some benchmarking for the two purposed solutions so far: the naive one, suggested by cjm, which is similar to my original solutions, and the memory-demanding one, suggested by Aristotle Pagaltzis Thanks again for both of you!

为了比较两者,我创建了以下使用相同接口的包:

To compare the two, I created the following packages which use the same interface:

use strict;
use warnings;

package RangeMap;

sub new {
    my $class      = shift;
    my $max_length = shift;
    my @lookup;
    for (@_) {
        my ( $start, $end ) = @$_;
        my @idx
            = $end >= $start
            ? $start .. $end
            : ( $start .. $max_length, 0 .. $end );
        for my $i (@idx) { $lookup[$i] .= pack 'L', $end }
    }
    bless \@lookup, $class;
}

sub num_ranges_containing {
    my $self = shift;
    my ( $start, $end ) = @_;
    return 0 unless defined $self->[$start];
    return 0 + grep { $end <= $_ } unpack 'L*', $self->[$start];
}

1;

和:

use strict;
use warnings;

package cjm;

sub new {
    my $class      = shift;
    my $max_length = shift;

    my $self = {};
    bless $self, $class;

    $self->{MAX_LENGTH} = $max_length;

    my @normal  = ();
    my @wrapped = ();

    foreach my $r (@_) {
        if ( $r->[0] <= $r->[1] ) {
            push @normal, $r;
        }
        else {
            push @wrapped, $r;
        }
    }

    $self->{NORMAL}  = \@normal;
    $self->{WRAPPED} = \@wrapped;
    return $self;
}

sub num_ranges_containing {
    my $self = shift;
    my ( $start, $end ) = @_;

    if ( $start <= $end ) {

        # This is a normal range
        return ( grep { $_->[0] <= $start and $_->[1] >= $end }
                @{ $self->{NORMAL} } )
            + ( grep { $end <= $_->[1] or $_->[0] <= $start }
                @{ $self->{WRAPPED} } );
    }
    else {

        # This is a wrapped range
        return ( grep { $_->[0] <= $start and $_->[1] >= $end }
                @{ $self->{WRAPPED} } )

            # This part should probably be calculated only once:
            + ( grep { $_->[0] == 1 and $_->[1] == $self->{MAX_LENGTH} }
                @{ $self->{NORMAL} } );
    }
}

1;

然后我使用了一些真实数据:$max_length=3150000,大约17000个范围,平均大小为几千,最后查询了大约10000个查询的对象.我对对象的创建(添加所有范围)和查询进行计时.结果:

I then used some real data: $max_length=3150000, about 17000 ranges with an average size of a few thousands, and finally queried the objects with some 10000 queries. I timed the creation of the object (adding all the ranges) and the querying. The results:

cjm creation done in 0.0082 seconds
cjm querying done in 21.209857 seconds
RangeMap creation done in 45.840982 seconds
RangeMap querying done in 0.04941 seconds

祝贺亚里士多德·帕加尔齐斯!您的实施速度非常快!但是,要使用此解决方案,我显然希望对对象进行一次预处理(创建).我可以在创建后存储 (nstore) 这个对象吗?我以前从来没有这样做过.我应该如何检索它?有什么特别的吗?希望检索速度快,以免影响这个伟大数据结构的整体性能.

Congratulations Aristotle Pagaltzis! Your implementation is super-fast! To use this solution, however, I will obviously like to do the pre-processing (creation) of the object once. Can I store (nstore) this object after its creation? I've Never done this before. And how should I retrieve it? Anything special? Hopefully the retrieval will be fast so it won't effect the overall performance of this great data structure.

更新 3

我尝试了一个简单的 nstore 并检索 RangeMap 对象.这似乎工作正常.唯一的问题是生成的文件大约为 1GB,我将有大约 1000 个这样的文件.我可以忍受 1 TB 的存储空间,但我想知道是否有任何方法可以更有效地存储它而不会显着影响检索性能.另请参阅此处:http://www.perlmonks.org/?node_id=861961.

I tried a simple nstore and retrieve for the RangeMap object. This seems to work fine. The only problem is the resulting file is around 1GB, and I will have some 1000 such file. I could live with a TB of storage for this, but I wonder if there's anyway to store it more efficiently without significantly effecting retrieval performance too much. Also see here: http://www.perlmonks.org/?node_id=861961.

更新 4 - RangeMap 错误

UPDATE 4 - RangeMap bug

不幸的是,RangeMap 有一个错误.感谢 PerlMonks 的 BrowserUK 指出这一点.例如,创建一个具有 $max_lenght=10 和单个范围 [6,2] 的对象.然后查询[7,8].答案应该是1,而不是0.

Unfortunately, RangeMap has a bug. Thanks to BrowserUK from PerlMonks for pointing that out. For example, create an object with $max_lenght=10 and as single range [6,2]. Then query for [7,8]. The answer should be 1, not 0.

我认为这个更新的包应该可以完成工作:

I think this updated package should do the work:

use strict;
use warnings;

package FastRanges;

sub new($$$) {
    my $class      = shift;
    my $max_length = shift;
    my $ranges_a   = shift;
    my @lookup;
    for ( @{$ranges_a} ) {
        my ( $start, $end ) = @$_;
        my @idx
            = $end >= $start
            ? $start .. $end
            : ( $start .. $max_length, 1 .. $end );
        for my $i (@idx) { $lookup[$i] .= pack 'L', $end }
    }
    bless \@lookup, $class;
}

sub num_ranges_containing($$$) {
    my $self = shift;
    my ( $start, $end ) = @_;    # query range coordinates

    return 0
        unless ( defined $self->[$start] )
        ;    # no ranges overlap the start position of the query

    if ( $end >= $start ) {

        # query range is simple
        # any inverted range in {LOOKUP}[$start] must contain it,
        # and so does any simple range which ends at or after $end
        return 0 + grep { $_ < $start or $end <= $_ } unpack 'L*',
            $self->[$start];
    }
    else {

        # query range is inverted
        # only inverted ranges in {LOOKUP}[$start] which also end
        # at of after $end contain it. simple ranges can't contain
        # the query range
        return 0 + grep { $_ < $start and $end <= $_ } unpack 'L*',
            $self->[$start];
    }
}

1;

欢迎您提出意见.

推荐答案

你有很多可用的内存吗?

Do you have a lot of memory available?

my $max_length = 9999;
my @range = ( [12,80],[34,60],[34,9000] );

my @lookup;

for ( @range ) {
    my ( $start, $end ) = @$_;
    my @idx = $end >= $start ? $start .. $end : ( $start .. $max_length, 0 .. $end );
    for my $i ( @idx ) { $lookup[$i] .= pack "L", $end }
}

现在您在 @lookup 中有一个压缩数字列表数组,其中每个索引处的压缩列表包含包含该点的所有范围的结尾.因此,要检查有多少范围包含另一个范围,您可以在数组中查找其起始索引,然后计算该索引处打包列表中小于或等于结束索引的条目数.对于覆盖任何一点的最大范围数(限制为范围的总数),该算法是O(n),每次迭代的开销非常小.

Now you have an array of packed number lists in @lookup where the packed list at each index contains the ends of all ranges which include that point. So to check how many ranges contain another range, you look up its starting index in the array, and then count the number of entries from the packed list at that index, which are smaller or equal that the ending index. This algorithm is O(n) with respect to the maximum number of ranges covering any one point (with the limit being the total number of ranges), with a very small overhead per iteration.

sub num_ranges_containing {
    my ( $start, $end ) = @_;

    return 0 unless defined $lookup[$start];

    # simple ranges can be contained in inverted ranges,
    # but inverted ranges can only be contained in inverted ranges
    my $counter = ( $start <= $end )
        ? sub { 0 + grep { $_ < $start or  $end <= $_ } }
        : sub { 0 + grep { $_ < $start and $end <= $_ } };

    return $counter->( unpack 'L*', $lookup[$start] );
}

未经测试.

为了更加整洁,

package RangeMap;

sub new {
    my $class = shift;
    my $max_length = shift;
    my @lookup;
    for ( @_ ) {
        my ( $start, $end ) = @$_;
        my @idx = $end >= $start ? $start .. $end : ( $start .. $max_length, 0 .. $end );
        for my $i ( @idx ) { $lookup[$i] .= pack 'L', $end }
    }
    bless \@lookup, $class;
}

sub num_ranges_containing {
    my $self = shift;
    my ( $start, $end ) = @_;

    return 0 unless defined $self->[$start];

    # simple ranges can be contained in inverted ranges,
    # but inverted ranges can only be contained in inverted ranges
    my $counter = ( $start <= $end )
        ? sub { 0 + grep { $_ < $start or  $end <= $_ } }
        : sub { 0 + grep { $_ < $start and $end <= $_ } };

    return $counter->( unpack 'L*', $self->[$start] );
}

package main;
my $rm = RangeMap->new( 9999, [12,80],[34,60],[34,9000] );

这样你就可以有任意数量的范围.

That way you can have any number of ranges.

同样未经测试.

这篇关于如何在 Perl 中有效地计算覆盖给定范围的范围?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆