什么是最快的子字符串搜索算法? [英] What is the fastest substring search algorithm?

查看:164
本文介绍了什么是最快的子字符串搜索算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好了,所以我不听起来像一个白痴,我要更加明确地陈述问题/要求:

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

  • 针(pattern)和干草堆(文字搜索)都是C风格的空值终止字符串。没有长度的信息提供;如果需要的话,它必须被计算出来。
  • 在函数返回一个指向第一场比赛,或 NULL 如果没有发现匹配。
  • 故障情况下是不允许的。这意味着任何算法与非恒定的(或大的常数)存储要求将需要有一个回退的情况下用于分配失败(和性能在回退护理从而有助于最坏情况的性能)。
  • 的实施是在C,虽然算法(或链接到这样)一个很好的说明无code是也没关系。
  • Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
  • Function should return a pointer to the first match, or NULL if no match is found.
  • Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
  • Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...还有就是我所说的最快:

...as well as what I mean by "fastest":

  • 确定性 O(N),其中 N =草垛长度。 (但是有可能使用的想法从算法,它通常是 O(nm)的(例如滚动散列),如果他们有一个更强大的算法相结合,得到确定性 O(N)的结果)。
  • 从不执行(可测量,一对夫妇的时钟如果(针[1])等,都还行!),比天真蛮力算法更糟糕,尤其是在很短的针这可能是最常见的情况。 (无条件重preprocessing开销是坏的,因为正在努力改善病理针的针很可能为代价的线性相关系数。)
  • 在考虑与任何其他广泛实现的算法,一个任意针和干草堆,相当或更好的性能(不逊于长50%的搜索时间)。
  • 除了这些条件,我要离开的定义最快的开放式。一个好的答案应该解释为什么你认为你在暗示办法最快。
  • Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
  • Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
  • Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
  • Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

我的当前实现运行在大致在慢10%和8倍的速度(取决于输入)比的glibc的实现双向的。

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way.

更新:我目前的最优算法如下:

  • 对于长度1,使用和strchr
  • 的针
  • 对于长2-4针,用机的话一下子就比较2-4字节如下:preLOAD针16位或32位整数,bitshifts和自行车老字节出/新字节从草堆在每次迭代。干草堆的每一个字节的准确读数一次,并招致反对0(字符串结尾)和一个16位或32位比较的检查。
  • 对于长度> 4针,使用双向算法和坏移表(如博耶 - 穆尔),它仅适用于窗口的最后一个字节。为了避免初始化1KB表,这将是一个净亏损为许多中等长度针的开销,我一直有点阵列(32字节),标志着其在移位表中的条目进行初始化。位,其未固化的对应于其中不会出现在针,为此,全针长度移位是可能的字节值。
  • For needles of length 1, use strchr.
  • For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.
  • For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.

留在我的脑海里最大的问题是:

The big questions left in my mind are:

  • 有没有一种方法能更好地利用坏移表的?博耶 - 穆尔使得它最好使用的向后扫描(从右到左),但双向需要一个左到右扫描。
  • 在只有两个可行的候选算法,我发现一般情况下(无外的内存或二次业绩条件)是的在有序双向字符串匹配字母的。但那里很容易检测的情况下,不同的算法是最优的?当然,许多 O(M)(其中 M 的针长)在太空中的算法可用于<$的C $ C> M&LT; 100 左右。这也将有可能使用的算法这是最坏的情况下二次,如果有一个简单的测试针这可证明只需要线性时间。
  • Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.
  • The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the O(m) (where m is needle length) in space algorithms could be used for m<100 or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.

积分为:

  • 您可提高通过假设针和干草堆性能均良好的UTF-8? (有不同的字节长度的字符,良好的岬规定的针和干草堆的一些字符串定位要求,而且具有不匹配头字节当遇到自动2-4字节的变化。但是,这些限制给你买多少/什么超出了最大后缀的计算,很好的后缀转移等已经给你用不同的算法?)

注意:我很清楚大部分的算法在那里,只是没有他们在实践中进行得怎么样了?这里有一个很好的参考,以便人们不要老是给我的算法为注释/参考答案:<一href="http://www-igm.univ-mlv.fr/~lecroq/string/index.html">http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Note: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

推荐答案

建立起来可能针头和草堆的测试库。简介测试的几个搜索算法,包括蛮力。挑选一个与您的数据进行最好的。

Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.

博耶 - 穆尔使用坏字符表具有良好的后缀表。

Boyer-Moore uses a bad character table with a good suffix table.

博耶 - 穆尔-Horspool 使用坏字符表。

克努特 - 莫里斯 - 普拉特使用部分匹配表。

拉宾,卡普使用运行哈希值。

它们均为商品开销减少到不同程度的比较,所以现实世界性能将取决于两针和干草堆的平均长度。越是初始开销,较长的投入就更好了。很短针,蛮力可能会获胜。

They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.

编辑:

一个不同的算法可能是最好寻找碱基对,英语短语或单个词。如果还是有一最好的算法对所有输入,它会被公开。

A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.

想想下面的小桌子。每个问号可能有不同的最佳搜索算法。

Think about the following little table. Each question mark might have a different best search algorithm.

                 short needle     long needle
short haystack         ?               ?
long haystack          ?               ?

这确实应该是一个图,射程较短,在每个轴不再投入。如果绘制每个算法在这样的曲线图中,每个将具有不同的签名。有些算法受到了很多重复的格​​局,这可能会影响到像搜索基因的用途。影响整体性能的一些其它因素正在寻找相同的图案一次以上,并在同一时间搜索不同的模式。

This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.

如果我需要一个样本集,我想我会刮掉像谷歌或维基百科网站,然后从所有的结果页面剥离HTML。对于一个搜索网站,一句话类型,然后使用的建议的搜索短语之一。如果有选择的几个不同的语言。使用网页,所有的文本将是短期到中期,所以合并足够的页来获得较长的文本。您还可以找到公共领域的书籍,法律记录,和文字等大型机构。或者只是通过从字典中挑选的话产生随机的内容。但分析的一点是要测试的,你必应搜索的内容类型,所以如果可能的话使用真实世界的样本。

If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.

我离开了短期和长期的模糊。对于针,我觉得短,在8个字符,介质在64个字符,只要在1K。对于干草堆,我觉得短,在2 ^ 10,中为第2 ^ 20,而只要达到2 ^ 30个字符。

I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.

这篇关于什么是最快的子字符串搜索算法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆