最快的子串搜索算法是什么? [英] What is the fastest substring search algorithm?

查看:36
本文介绍了最快的子串搜索算法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,所以我听起来不像个白痴,我将更明确地说明问题/要求:

  • Needle(模式)和 haystack(要搜索的文本)都是 C 风格的以空字符结尾的字符串.没有提供长度信息;如果需要,必须进行计算.
  • 函数应返回指向第一个匹配项的指针,如果未找到匹配项,则返回 NULL.
  • 不允许出现故障情况.这意味着任何具有非常量(或大常数)存储要求的算法都需要有分配失败的回退情况(因此回退处理中的性能会导致最坏情况的性能).
  • 应使用 C 语言实现,尽管对算法(或指向此类的链接)的良好描述无需代码也可以.

...以及我所说的最快":

  • 确定性 O(n) 其中 n = haystack 长度.(但是,如果将通常O(nm)(例如滚动散列)的算法与更健壮的算法结合起来给出确定性的O(n) 结果).
  • 从不执行(可测量;if (!needle[1]) 等的几个时钟都可以)比朴素的蛮力算法差,尤其是在很短的针上,这可能是最常见的情况.(无条件繁重的预处理开销很糟糕,因为试图以牺牲可能的针为代价来提高病理针的线性系数也是如此.)
  • 考虑到任意的针头和大海捞针,与任何其他广泛实施的算法相比,性能相当或更好(搜索时间不超过 50%).
  • 除了这些条件之外,我还没有定义最快"的定义.一个好的答案应该解释为什么您认为您建议的方法是最快的".

与 glibc 的双向实现相比,我当前的实现运行速度大约慢 10% 和快 8 倍(取决于输入).

更新:我目前的最优算法如下:

  • 对于长度为 1 的针,使用 strchr.
  • 对于长度为 2-4 的针,使用机器字一次比较 2-4 个字节,如下所示:在 16 位或 32 位整数中预加载针并进行位移并循环从大海捞针中取出旧字节/新字节在每次迭代中.haystack 的每个字节都被准确读取一次,并针对 0(字符串结尾)和一个 16 位或 32 位比较进行检查.
  • 对于长度大于 4 的针,使用带有错误移位表(如 Boyer-Moore)的双向算法,该算法仅应用于窗口的最后一个字节.为了避免初始化一个 1kb 表的开销,这对于许多中等长度的针来说是一个净损失,我保留了一个位数组(32 字节)来标记移位表中的哪些条目被初始化.未设置的位对应于从未出现在针中的字节值,对于这种值,可以进行全针长移位.

留在我脑海中的大问题是:

  • 有没有办法更好地利用糟糕的班次表?Boyer-Moore 通过向后扫描(从右到左)来充分利用它,但双向扫描需要从左到右扫描.
  • 我为一般情况(没有内存不足或二次性能条件)找到的唯一两种可行的候选算法是 双向有序字母的字符串匹配.但是,是否存在易于检测的情况,其中不同的算法是最佳的?当然,空间算法中的许多O(m)(其中m 是针的长度)可以用于m<100 左右.如果有一个简单的针测试,证明只需要线性时间,那么也可以使用最坏情况二次方算法.

奖励积分:

  • 假设 Needle 和 haystack 都是格式良好的 UTF-8,您能否提高性能?(对于不同字节长度的字符,格式良好在指针和 haystack 之间强加了一些字符串对齐要求,并在遇到不匹配的头字节时允许自动 2-4 个字节移位.但是这些限制是否会给你带来很多/超出什么?最大后缀计算、良好的后缀转换等已经为您提供了各种算法?)

注意:我很清楚那里的大多数算法,只是不知道它们在实践中的表现如何.这是一个很好的参考,所以人们不会一直给我作为评论/答案的算法参考:http://www-igm.univ-mlv.fr/~lecroq/string/index.html

解决方案

建立一个可能的针头和大海捞针的测试库.分析几种搜索算法的测试,包括蛮力.选择最适合您的数据的那个.

Boyer-Moore 使用带有良好后缀表的错误字符表.>

Boyer-Moore-Horspool 使用了错误的字符表.

Knuth-Morris-Pratt 使用部分匹配表.

Rabin-Karp 使用运行哈希.

它们都以不同程度的减少比较来交换开销,因此现实世界的性能将取决于针和干草堆的平均长度.初始开销越多,输入越长越好.使用非常短的针,蛮力可能会胜出.

不同的算法可能最适合查找碱基对、英语短语或单个单词.如果所有输入都有一种最佳算法,它早就被公开了.

想想下面的小桌子.每个问号可能有不同的最佳搜索算法.

 短针长针短草垛??长草垛??

这应该是一个图表,每个轴上都有一系列较短到较长的输入.如果在这样的图上绘制每个算法,每个算法都会有不同的签名.一些算法在模式中存在大量重复,这可能会影响搜索基因等用途.其他一些影响整体性能的因素是多次搜索相同的模式并同时搜索不同的模式.

如果我需要一个样本集,我想我会抓取像 google 或 wikipedia 这样的网站,然后从所有结果页面中删除 html.对于搜索站点,输入一个词,然后使用建议的搜索短语之一.如果适用,请选择几种不同的语言.使用网页,所有的文本都是短到中等的,所以合并足够多的页面以获得更长的文本.您还可以找到公共领域的书籍、法律记录和其他大量文本.或者只是通过从字典中挑选单词来生成随机内容.但分析的重点是针对您将要搜索的内容类型进行测试,因此请尽可能使用真实世界的样本.

我留下了简短和长期的含糊不清.对于针,我认为短的是8个字符以下,中的是64个字符以下,长的是1k以下.对于大海捞针,我认为短的是 2^10 以下,中的是 2^20 以下,长的是 2^30 个字符.

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

  • Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
  • Function should return a pointer to the first match, or NULL if no match is found.
  • Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
  • Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

  • Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
  • Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
  • Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
  • Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way.

Update: My current optimal algorithm is as follows:

  • For needles of length 1, use strchr.
  • For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.
  • For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.

The big questions left in my mind are:

  • Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.
  • The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the O(m) (where m is needle length) in space algorithms could be used for m<100 or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.

Bonus points for:

  • Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)

Note: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

解决方案

Build up a test library of likely needles and haystacks. Profile the tests on several search algorithms, including brute force. Pick the one that performs best with your data.

Boyer-Moore uses a bad character table with a good suffix table.

Boyer-Moore-Horspool uses a bad character table.

Knuth-Morris-Pratt uses a partial match table.

Rabin-Karp uses running hashes.

They all trade overhead for reduced comparisons to a different degree, so the real world performance will depend on the average lengths of both the needle and haystack. The more initial overhead, the better with longer inputs. With very short needles, brute force may win.

Edit:

A different algorithm might be best for finding base pairs, english phrases, or single words. If there were one best algorithm for all inputs, it would have been publicized.

Think about the following little table. Each question mark might have a different best search algorithm.

                 short needle     long needle
short haystack         ?               ?
long haystack          ?               ?

This should really be a graph, with a range of shorter to longer inputs on each axis. If you plotted each algorithm on such a graph, each would have a different signature. Some algorithms suffer with a lot of repetition in the pattern, which might affect uses like searching for genes. Some other factors that affect overall performance are searching for the same pattern more than once and searching for different patterns at the same time.

If I needed a sample set, I think I would scrape a site like google or wikipedia, then strip the html from all the result pages. For a search site, type in a word then use one of the suggested search phrases. Choose a few different languages, if applicable. Using web pages, all the texts would be short to medium, so merge enough pages to get longer texts. You can also find public domain books, legal records, and other large bodies of text. Or just generate random content by picking words from a dictionary. But the point of profiling is to test against the type of content you will be searching, so use real world samples if possible.

I left short and long vague. For the needle, I think of short as under 8 characters, medium as under 64 characters, and long as under 1k. For the haystack, I think of short as under 2^10, medium as under a 2^20, and long as up to a 2^30 characters.

这篇关于最快的子串搜索算法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆