子序列查询的数据结构 [英] Data Structure for Subsequence Queries

查看:160
本文介绍了子序列查询的数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在程序中,我需要有效地回答以下表单的查询:

In a program I need to efficiently answer queries of the following form:


给定一组字符串一个和一个查询字符串 q 返回所有 s∈A ,这样q是< s

Given a set of strings A and a query string q return all s ∈ A such that q is a subsequence of s

例如,给定 A = {abcdef,aaaaaa,ddca} q =acd完全abcdef应该返回。

For example, given A = {"abcdef", "aaaaaa", "ddca"} and q = "acd" exactly "abcdef" should be returned.

以下是我以前考虑过考虑的:

The following is what I have considered considered so far:


  1. 对于每个可能的字符,制作出现的所有字符串/位置的排序列表。对于查询所涉及字符的列表进行交织,并扫描通过它查找字符串边界内的匹配。

  1. For each possible character, make a sorted list of all string/locations where it appears. For querying interleave the lists of the involved characters, and scan through it looking for matches within string boundaries.

这可能对于单词而不是字符更有效,因为有限数量的不同字符将使返回列表非常密集。

This would probably be more efficient for words instead of characters, since the limited number of different characters will make the return lists very dense.

对于每个n前缀 q 可能存储所有匹配字符串的列表。 n 可能实际上接近3.对于长于我们的查询字符串,我们强制强制初始列表。

For each n-prefix q might have, store the list of all matching strings. n might realistically be close to 3. For query strings longer than that we brute force the initial list.

可能会加快一点,但是可以很容易地想象一些n子序列存在于 A 中的所有字符串附近,这意味着最坏的情况与刚才强制的整个集合。

This might speed things up a bit, but one could easily imagine some n-subsequences being present close to all strings in A, which means worst case is the same as just brute forcing the entire set.






你知道任何数据结构,算法或预处理技巧,这可能有助于对于大的 A 有效执行上述任务? (我的 s 将大约100个字符)


Do you know of any data structures, algorithms or preprocessing tricks which might be helpful for performing the above task efficiently for large As? (My ss will be around 100 characters)

更新:有些人建议使用LCS来检查 q 是否为 s 的子序列。我只想提醒一下,可以使用简单的函数来完成,例如:

Update: Some people have suggested using LCS to check if q is a subsequence of s. I just want to remind that this can be done using a simple function such as:

def isSub(q,s):
  i, j = 0, 0
  while i != len(q) and j != len(s):
    if q[i] == s[j]:
      i += 1
      j += 1
    else:
      j += 1
  return i == len(q)

更新2:我被要求提供有关 q的性质的更多详细信息q A 及其元素。虽然我喜欢尽可能普遍的工作,我假设 A 的长度约为10 ^ 6,需要支持插入。元素 s 将更短,平均长度为64.查询 q 只能为1到20个字符,用于实时搜索,因此查询ab将在查询abc之前发送。再次,我更喜欢该解决方案尽可能少地使用上述。

Update 2: I've been asked to give more details on the nature of q, A and its elements. While I'd prefer something that works as generally as possible, I assume A will have length around 10^6 and will need to support insertion. The elements s will be shorter with an average length of 64. The queries q will only be 1 to 20 characters and be used for a live search, so the query "ab" will be sent just before the query "abc". Again, I'd much prefer the solution to use the above as little as possible.

更新3:我发现,使用 O(n ^ {1-epsilon})查找的数据结构将允许您解决OVP /反驳SETH猜想。这可能是我们痛苦的原因。唯一的选择是反驳推测,使用近似或利用数据集。

Update 3: It has occurred to me, that a data-structure with O(n^{1-epsilon}) lookups, would allow you to solve OVP / disprove the SETH conjecture. That is probably the reason for our suffering. The only options are then to disprove the conjecture, use approximation, or take advantage of the dataset. I imagine quadlets and tries would do the last in different settings.

推荐答案

测试



在这个主题中有四个主要提案:

Tests

There have been four main proposals in this thread:


  1. Shivam Kalra建议基于所有字符串创建一个自动机在 A 中。通常以文字Directed Acyclic Subsequence Graph(DASG)命名,这种方法已经在文献中略作尝试。

  1. Shivam Kalra suggested creating an automaton based on all the strings in A. This approach has been tried slightly in the literature, normally under the name "Directed Acyclic Subsequence Graph" (DASG).

J Random Hacker建议扩展我的'前缀列表'想法到所有'n选择3'三元组在查询字符串中,并使用堆合并它们全部。

J Random Hacker suggested extending my 'prefix list' idea to all 'n choose 3' triplets in the query string, and merging them all using a heap.

在Efficient Subsequence Search in数据库Rohit Jain,Mukesh K. Mohania和Sunil Prabhakar建议使用一些Trie结构进行一些优化,并递归地搜索树的查询。他们也有一个类似于三元组思想的建议。

In the note "Efficient Subsequence Search in Databases" Rohit Jain, Mukesh K. Mohania and Sunil Prabhakar suggest using a Trie structure with some optimizations and recursively search the tree for the query. They also have a suggestion similar to the triplet idea.

最后还有一个天真的方法,wanghq通过存储每个元素的索引来提出优化 A

Finally there is the 'naive' approach, which wanghq suggested optimizing by storing an index for each element of A.

为了更好地了解什么是值得的继续努力,我已经在Python中实现了上述四种方法,并对两组数据进行了基准测试。在C或Java中完成的实现可以使实现速度提高几倍;我没有包括trie和naive版本建议的优化。

To get a better idea of what's worth putting continued effort into, I have implemented the above four approaches in Python and benchmarked them on two sets of data. The implementations could all be made a couple of magnitudes faster with a well done implementation in C or Java; and I haven't included the optimizations suggested for the 'trie' and 'naive' versions.

A 由我的文件系统的随机路径组成。 q 是100个随机的 [az] 平均长度的字符串7.由于字母很大(Python很慢)我只能使用方法3的副本。

A consists of random paths from my filesystem. q are 100 random [a-z] strings of average length 7. As the alphabet is large (and Python is slow) I was only able to use duplets for method 3.

作为 A 的函数的施工时间(秒)

Construction times in seconds as a function of A size:

查询时间秒作为 A的功能 size:

Query times in seconds as a function of A size:

A 随机抽取 [ab] 长度为20的字符串。 q 是100个随机的 [ab] 平均长度的字符串7.由于字母很小,我们可以使用方法3的quadlet。

A consists of randomly sampled [a-b] strings of length 20. q are 100 random [a-b] strings of average length 7. As the alphabet is small we can use quadlets for method 3.

以秒为单位的构造时间作为 A size:

Construction times in seconds as a function of A size:

以秒为单位的查询时间作为 A 的大小:

Query times in seconds as a function of A size:

双对数图有点难以阅读,但从数据中我们可以得出以下结论:

The double logarithmic plot is a bit hard to read, but from the data we can draw the following conclusions:


  • 自动机在查询时非常快速(不变时间),但是不可能为 | A | > = 256 。有可能更仔细的分析可以产生更好的时间/记忆平衡,或者适用于其余方法的一些技巧。

  • Automatons are very fast at querying (constant time), however they are impossible to create and store for |A| >= 256. It might be possible that a closer analysis could yield a better time/memory balance, or some tricks applicable for the remaining methods.

dup- / trip- / quadlet方法是我的trie实现的两倍,是天真实现的四倍。我只使用线性数量的列表合并,而不是由j_random_hacker建议的 n ^ 3 。可能可以更好地调整方法,但一般来说,这是令人失望的。

The dup-/trip-/quadlet method is about twice as fast as my trie implementation and four times as fast as the 'naive' implementation. I used only a linear amount of lists for the merge, instead of n^3 as suggested by j_random_hacker. It might be possible to tune the method better, but in general it was disappointing.

我的trie实现始终比天真的方法更好,二。通过整合更多的预处理(如这个子树中的下一个c)或者将它与三元组方法合并,这似乎是今天的赢家。

My trie implementation consistently does better than the naive approach by around a factor of two. By incorporating more preprocessing (like "where are the next 'c's in this subtree") or perhaps merging it with the triplet method, this seems like todays winner.

如果你能做的更少的表现,天真的方法相对来说只是很少的成本。

If you can do with a magnitude less performance, the naive method does comparatively just fine for very little cost.

这篇关于子序列查询的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆