在Perl中对数组进行二进制搜索 [英] binary search in an array in Perl

查看:90
本文介绍了在Perl中对数组进行二进制搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个十六进制数字数组,我需要遍历其他数字并检查它们是否出现在该数组中.现在我正在使用foreach循环,每次循环遍历整个数组.有没有一种方法可以通过首先对数组进行排序,然后对它执行二进制搜索来使其更快.

I have an array of hex numbers, and I need to go over other numbers and check if they appear in that array. Right now i'm using a foreach loop that goes over the entire array each time. Is there a way to make it faster by sorting the array at first, and then implementing binary search on it.

当前代码:

sub is_bad_str{
  my ($str, @keys) = @_;
  my $flag = 0;
  my ($key, $hex_num);
        if ($str =~ m/14'h([0-9a-f][0-9a-f][0-9a-f][0-9a-f])/;){ #'# fixes bad highlighting
  $hex_num = $1;
      }
  if (defined $hex_num){
    foreach $key (@keys){
        if ($hex_num =~ /\Q$key\E/i){
            $flag = 1;
            last;
        }
    }
  }
  if (($flag == 0) && (defined $hex_num)){
    return 1;#Bad str
  }else{
    return 0;#Good str
      }
}

推荐答案

有四种策略可以在Perl中对一组数据进行有效的批量搜索.

There are four strategies to doing efficient bulk search in a set of data in Perl.

下面概述了完整的分析,但总而言之,哈希查找和更差的BST当然可以提供具有大量搜索次数的平均随机数据集的最佳性能.

The full analysis is outlined below, but in summary, the best performance on average random data set with a significant # of searches is of course offered by hash lookups, followed by much-worse BST.

  1. 二进制(半间隔)搜索数组.

这显然是一种标准的算法方法.

This is obviously a standard algorithmic approach.

性能成本:

Performance Costs:

  • O(N * log N)用于初始排序.
  • O(N)平均用于在排序后在列表中插入/删除数据. Perl数组不是链接列表,因此它不是O(log N).
  • 每次搜索
  • O(log N).

  • O(N * log N) for initial sorting.
  • O(N) on average for inserting/removing data in the list once sorted. Perl arrays are NOT linked lists so it's not an O(log N).
  • O(log N) for each search.

实现 :算法非常简单,DIY很容易.像往常一样,存在CPAN模块,并且无论如何应该应该使用CPAN模块代替DIY: Search::Binary .

Implementation: the algorithm is so simple that DIY is easy. As usual, CPAN modules exist and should probably be used instead of DIY anyway: Search::Binary .

二进制搜索树(BST)

Binary Search Trees (BSTs)

性能成本:

Performance Costs:

  • O(N * log N)用于初始排序.
  • O(log N)平均用于在排序后在列表中插入/删除数据
  • O(log N)进行每次搜索.
  • O(N * log N) for initial sorting.
  • O(log N) on average for inserting/removing data in the list once sorted
  • O(log N) for each search.


实现 :CPAN上存在多种形式:在算法上具有更好的平均性能和较小的性能波动.

比较 :如果数据将更改,则必须使用BST以避免重新分类费用.如果您的数据是随机的并且一旦排序就永远不会改变,则可以在BST上使用简单的二进制搜索,但是如果性能的每一盎司都可以更好地调整BST(如果您知道查找的结果,则BST可以比列表二进制搜索进行优化以实现更快的平均搜索速度)基于数据分布的费用-请参见 Wiki的最佳二进制搜索树"部分,或者数据分发有利于特殊树之一,例如Treap或Red/Black).

Comparison: If the data WILL change, you must use BST to avoid re-sort costs. If your data is random and never changes once sorted, you can use simple binary search over BST but BSTs can be tuned better if every last ounce of performance matter (BST can be optimized for faster average searching than list binary search if you know your lookup costs based on data distribution - see Wiki's "Optimal binary search trees" section or if your data distribution favors one of the special trees like Treap or Red/Black).

缩写(短路)扫描查找.

这些是对未排序列表的线性扫描搜索,一旦找到该项目就会停止搜索.

These are linear scan searches on un-sorted list which STOP searching once the item is found.

性能 :每次搜索O(N)随机数据,但O(N)(例如N/2)比完整列表搜索(如 grep .没有额外的费用.

Performance: O(N) per search for random data, but a faster O(N) (say, N/2) than a full-list search like grep. No extra costs.

实现 :在Perl中有3种方法来实现它们:

Implementation: There are 3 ways to do them in Perl:

  • 智能匹配运算符(~~).问题在于它仅在Perl 5.10及更高版本中可用.
  • 您自己的循环,一旦找到就会执行next;.
  • List::MoreUtils 模块的子例程.
  • Smart match operator (~~). The problem is that it is ONLY available in Perl 5.10 and above.
  • Your own loop that does next; once found.
  • List::MoreUtils module's first() subroutine.

比较:

Comparison:

  • 首先,在上述3个实现之间,List::MoreUtils::first比DIY循环更快,因为它是在XS中实现的;因此应在5.10之前的Perl版本中使用.智能匹配可能同样快,尽管我会在选择Perl 5.10+中的一个之前先对两者进行基准测试.

  • First, between the 3 implementations above, List::MoreUtils::first is faster than the DIY loop because it's implemented in XS; so it should be used in Perl versions before 5.10. Smart match is probably just as fast, though I would benchmark the two before you pick one or the other in Perl 5.10+.

第二,将短路搜索与其他方法进行比较,只有3种极端情况应使用它:

Second, comparing short circuited search to other methods, there are only 3 edge cases where it should be used:

A. 内存限制.排序列表搜索,BST和哈希查找的内存占用量至少为2*N.如果您面临严重的内存限制(给定列表大小),以致N2*N内存成为不可协商的成本障碍,那么您可以使用短路搜索并及时支付性能损失. 当您批量/逐行处理大型数据集时,尤其如此,这样可以避免首先将整个数据存储在内存中.

A. Memory constraints. Both sorted list search, BST and hash lookups have memory footprint of at the very least 2*N. If you face a memory constraint (given your list size) severe enough that N vs 2*N memory becomes a non-negotiable cost barrier, then you use short circuited search and pay the performance penalty in time. This is especially true when you're processing a large data set in batches/line-by-line, so as to avoid storing the whole thing in memory in the first place.

B.如果您的数据以这样的方式进行分配和预排序,以使VAST大多数搜索都将在列表的最开始找到它们的采石场.如果是这样,尽管它的O(log N)平均搜索速度明显更快,但它仍可能会胜过像二进制搜索的BST这样的更出色的方法.仍然很难胜过哈希查找,但稍后会详细介绍.

B. If your data is distributed and pre-sorted in such a way that a VAST majority of the searches would find their quarry in the very beginning of the list. If that's the case, it MAY outperform the fancier methods like BST of binary search despite their obviously faster O(log N) average searches. It'd still be hard to outperform hash lookups, but more on that later.

C.如果执行的搜索数量与列表大小相比很小,则短路搜索优于BST或排序列表搜索,在这种情况下,前两种方法的初始排序成本(O(N log N))会超过搜索节省量.由于BST与线性搜索的节省量为O(M * N),其中M为搜索次数,因此,为了实现平均节省,搜索次数M必须小于O(log N),但第二次可能要更多边缘情况,由于数据分布,平均扫描成本低于O(N).

C. Short circuited search is superior to BSTs or sorted list searches if the # of searches performed is fairly small compared to the list size, in which case the initial sorting cost of the first 2 methods (O(N log N)) would outweigh the search savings. Since the savings of BST vs. linear search are O(M * N) where M is # of searches, it follows that # of searches M must be less than O(log N) to realize the savings on average, but may be more in the second edge case where average scan cost is less than O(N) due to data distribution.

哈希查找

性能成本:

Performance Costs:

  • O(N) + epsilon用于初始哈希创建(由于可能发生键冲突,因此严格来讲,对于随机大数据集,严格来说,这不是O(N).我对Perl的哈希实现不了解得足够多,只能说明它任何哈希表都可能引起关注.
  • O(1)平均用于在排序后在列表中插入/删除数据(由于键冲突,+与初始哈希创建相同的epsilon).
  • 每次搜索
  • O(1)(加上相同的epsilon).

  • O(N) + epsilon for initial hash creation (It's not strictly speaking O(N) for a random large data set, due to possible key collision. I don't know enough about Perl's hash implementation to clarify this other than state that it CAN be come a concern for any hashmaps.
  • O(1) on average for inserting/removing data in the list once sorted (+ same epsilon as initial hash creation due to key collisions).
  • O(1) for each search (plus same epsilon).

实施:

Implementation:

my %lookup = map { $_ => 1 } @list; 
my @lookup2{ @list } = (); # Alternative method of creating a hash
sub find { return exists $lookup{$_[0]; }

比较:

Comparison:

  • 首先,与BST和短路搜索相比,将相同的逻辑应用于比较哈希查找的短路搜索和短路搜索.例如,除了相同的两种边缘情况(数据集使得平均列表扫描变为O(1)而不是O(N)且搜索次数与数据集之比)外,ALMOST始终在线性搜索上始终使用哈希图大小使得搜索的总费用少于创建哈希所需的O(N).

  • First, the same logic applies to comparing short circuited search with hash lookups as with BST vs short-circuited search. E.g., you should ALMOST always use hashmaps over linear search, with the exception of the same two edge cases (data set is such that average list scan becomes O(1) instead of O(N) and the ratio of # of searches to the data set size makes the aggregate cost of searches less than O(N) needed for hash creation).

第二,平均散列图显然比BST或二进制列表搜索快得多 .唯一可能出现的情况是,您不知何故陷入了一个数据集,该数据集设法使存储桶超载,并将多余的ε"成本转化为足够大的权重,从而开始表现不佳..

Second, hashmaps ON AVERAGE are obviously much faster than BSTs or binary list search. The only possible edge case here is that you somehow stumble into a data set that manages to overload the buckets and turn that extra "epsilon" cost into a large enough weight that it starts under-performing O(log N).

我强烈怀疑它是否甚至有可能实现,但是再次重申,对Perl的hashmap实现了解不足,无法证明即使是最差的数据集也永远不会发生.

I strongly doubt that it is even remotely possible, but again, don't know enough about Perl's implementation of hashmaps to prove that it would never happen for even the worst data set.

这篇关于在Perl中对数组进行二进制搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆