需要快速存储和检索(搜索)集和子集的算法 [英] Need algorithm for fast storage and retrieval (search) of sets and subsets

查看:266
本文介绍了需要快速存储和检索(搜索)集和子集的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一种存储任意大小的集合的方式,以便以后快速查询。
我需要查询已存储的子集或集合的结果数据结构。



===
稍后编辑:澄清,对这个问题接受的答案将是一个建议解决这个问题的研究的链接。我不期望人们自己开发算法。
我一直在查找的元组聚类算法由 Franklin Mark Li发明



这样的特殊树用于拼写检查或自动完成,并且实际上接近你想要的行为,特别是允许很方便地搜索子集。



你的case的区别是你不感兴趣的属性/功能的顺序。对于您的情况,一个集合是由Iztok Savnik发明的。



什么是集合树?一个树,其中除根之外的每个节点都包含单个属性值(数字)和标记)如果在这个节点有一个数据条目。每个子树仅包含其值大于父节点的属性值的属性。设置树的根是空的。搜索键是从根到树的某个节点的路径。搜索结果是从根到所有节点的路径集,其中包含您在向下走树和向上搜索键时到达的标记(见下文)。



但首先是我的绘图:





属性是{1,2,3,4,5},可以是任何东西,但我们只是枚举它们,因此自然获得一个顺序。数据是{{1,2,4},{1,3},{1,4},{2,3,5},{2,4}},其中图片中是从根到任何圈子。这些圆圈是图片中数据的标记。



请注意,根的右子树不包含属性1。



搜索包括子集。想要搜索属性4和1.首先对它们进行排序,搜索键为{ 1,4}。现在从根开始,你同时向上搜索键和树。这意味着你获取键(1)中的第一个属性,并且遍历属性小于或等于1的所有子节点。只有一个,即1.在键(4)中获取下一个属性并访问属性值小于4的所有子节点,即all。您继续,直到没有剩下的事要做,并收集所有的属性值为4(或键的最后一个属性)的圈子(数据条目)。这些是{1,2,4}和{1,4},但不是{1,3}(无4)或{2,4}(无1)。



插入非常容易。沿着树向下,并在适当的位置存储数据条目。例如,数据输入{2.5}将存储为{2}的子级。



动态添加属性 {1,4,6}。它当然会低于{1,4}。



我希望你能理解我想对Set-Tries说的话。在Iztok Savnik的文件中,它的解释更详细。他们可能很有效率。



我不知道你是否仍然想要将数据存储在数据库中。我认为这会进一步复杂的事情,我不知道什么是最好的做。


I need a way of storing sets of arbitrary size for fast query later on. I'll be needing to query the resulting data structure for subsets or sets that are already stored.

=== Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves. I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.

Now, an even simpler example:

[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]

Query:

[alpha, delta]

Result:

[alpha, beta, gama, delta] [alpha, epsilon, delta]

So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.

== Original examples:

For example, say the database contains these sets

[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1] 

If I use [A1, C1] as a query, these two sets should be returned as a result:

[A1, B1, C1, D1], [A1, D3, C1]

Example 2:

Database:

[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]

Query:

[Distance to berlin: 240km]

Result

[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]

There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.

What algorithms are there for such needs?

I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.

Clarifications:

  • If it helps answer the question, I'm intending on using them for storing states: Simple example: [Has milk, Doesn't have eggs, Has Sugar]
  • I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure

Conclusion I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:

解决方案

I'm confident that I can now contribute to the solution. One possible quite efficient way is a:

Trie invented by Frankling Mark Liang

Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.

The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.

What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).

But first a drawing by me:

The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.

Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.

Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).

Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.

Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.

I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.

I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.

这篇关于需要快速存储和检索(搜索)集和子集的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆