索引搜索子集的数据结构 [英] Data structure for indexed searches of subsets

查看:138
本文介绍了索引搜索子集的数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发 c#jquery实施,并试图找出一种高效的定位算法整个DOM的子集中的元素(例如子选择器)。目前,我正在创建一个常用选择器的索引:构建DOM时的类,标识和标签。

I'm working on a c# jquery implementation and am trying to figure out an efficient algorithm for locating elements in a subset of the entire DOM (e.g. a subselector). At present I am creating an index of common selectors: class, id, and tag when the DOM is built.

基本的数据结构是可以预期的,一个树元素,其中包含 IEnumerable< Element>儿童父母。当使用 Dictonary< string,HashSet< Element>> 搜索整个域时,这很简单,用于存储索引。

The basic data structure is as one would expect, a tree of Elements which contain IEnumerable<Element> Children and a Parent. This is simple when searching the whole domain using a Dictonary<string,HashSet<Element>> to store the index.

我无法让我的头脑围绕使用索引搜索元素的子集的最有效的方式。我使用术语子集来指代起始集合,链从后面的选择器将被运行。以下是我想到的方法:

I have not been able to get my head around the most effective way to search subsets of elements using an index. I use the term "subset" to refer to the starting set from which a subsequent selector in a chain will be run against. The following are methods I've thought of:


  1. 从整个DOM中检索匹配项以进行子查询,并消除不属于子集。这需要遍历每个匹配的父母,直到找到根(并且被消除)或者发现子集的成员(并且它是一个小孩,因此包括在内)

  2. 维护每个元素的索引分开。

  3. 为每个元素维护一组父类(通过消除遍历来使#1快速)

  4. 重建整个每个子查询的索引。

  5. 只需搜索主要选择器即可手动搜索。

  1. Retrieve matches from entire DOM for a subquery, and eliminate those that are not part of the subset. This requires traversing up the parents of each match until the root is found (and it is eliminated) or a member of the subset is found (and it is a child, hence included)
  2. Maintain the index separately for each element.
  3. Maintain a set of parents for each element (to make #1 fast by eliminating traversal)
  4. Rebuild the entire index for each subquery.
  5. Just search manually except for primary selectors.

每个可能的技术很大程度上取决于正在完成的操作。大多数时候,#1可能相当不错,因为大多数时候,当您进行子选择时,您将针对特定元素。所需的迭代次数将是结果数*每个元素的平均深度。

The cost of each possible technique depends greatly on the exact operation being done. #1 is probably pretty good most of the time, since most of the time when you do a sub-select, you're targeting specific elements. The number of iterations required would be the number of results * the average depth of each element.

第二种方法是迄今为止最快的选择,但费用的存储需求随深度呈指数增长,索引维护困难。我几乎已经消除了这一点。

The 2nd method would be by far the fastest for selecting, but at the expense of storage requirements that increase exponentially with depth, and difficult index maintenance. I've pretty much eliminated this.

第三种方法具有相当差的内存占用(尽管比#2好多了) - 这可能是合理的,但除了存储要求,添加和删除元素变得更加昂贵和复杂。

The 3rd method has a fairly bad memory footprint (though much better than #2) - it may be reasonable, but in addition to the storage requirements, adding and removing elements becomes substantially more expensive and complicated.

第4种方法需要遍历整个选择,所以看起来毫无意义,因为大多数子查询只能跑一次如果预计会重复一个副手,这将是有益的。 (或者,我也可以在遍历子集时执行此操作 - 除了一些选择器不需要搜索整个子域,例如ID和位置选择器)。

The 4rd method requires traversing the entire selection anyway so it seems pointless since most subqueries are only going to be run once. It would only be beneficial if a subequery was expected to be repeated. (Alternatively, I could just do this while traversing a subset anyway - except some selectors don't require searching the whole subdomain, e.g. ID and position selectors).

第5方法对于有限的子集将是适用的,但比DOM的大部分子集的第一种方法差得多。

The 5th method will be fine for limited subsets, but much worse than the 1st method for subsets that are much of the DOM.

有关如何最好地完成此任务的任何想法或其他想法?我可以做一些#1和#4的混合,猜测哪个更有效,因为被搜索的子集的大小与DOM的大小,但这是非常模糊的,我宁愿找到一些通用的解决方案。现在我只是使用#4(只有全DOM查询使用索引)这是很好的,但真的很糟糕,如果你决定做一些像 $('body')。Find('#id ')

Any thoughts or other ideas about how best to accomplish this? I could do some hybrid of #1 and #4 by guessing which is more efficient given the size of the subset being searched vs. the size of the DOM but this is pretty fuzzy and I'd rather find some universal solution. Right now I am just using #4 (only full-DOM queries use the index) which is fine, but really bad if you decided to do something like $('body').Find('#id')

免责声明:这是早期优化。我没有需要解决的瓶颈,但作为一个学术问题,我不能停止思考...

Disclaimer: This is early optimization. I don't have a bottleneck that needs solving, but as an academic problem I can't stop thinking about it...

解决方案

这是答案中提出的数据结构的实现。作为一个字典的附近替代品,它的工作完美。

Here's the implementation for the data structure as proposed by the answer. Is working perfectly as a near drop-in replacement for a dictionary.

interface IRangeSortedDictionary<TValue>: IDictionary<string, TValue>
{
    IEnumerable<string> GetRangeKeys(string subKey);
    IEnumerable<TValue> GetRange(string subKey);

}
public class RangeSortedDictionary<TValue> : IRangeSortedDictionary<TValue>
{
    protected SortedSet<string> Keys = new SortedSet<string>();
    protected Dictionary<string,TValue> Index = 
        new Dictionary<string,TValue>();
    public IEnumerable<string> GetRangeKeys(string subkey)
    {
        if (string.IsNullOrEmpty(subkey)) {
            yield break;
        }
        // create the next possible string match
        string lastKey = subkey.Substring(0,subkey.Length - 1) +
            Convert.ToChar(Convert.ToInt32(subkey[subkey.Length - 1]) + 1);

        foreach (var key in Keys.GetViewBetween(subkey, lastKey))
        {
            // GetViewBetween is inclusive, exclude the last key just in case
            // there's one with the next value
            if (key != lastKey)
            {
                yield return key;
            }
        }
    }

    public IEnumerable<TValue> GetRange(string subKey)
    {
        foreach (var key in GetRangeKeys(subKey))
        {
            yield return Index[key];
        }
    }
    // implement dictionary interface against internal collections
}

代码在这里: http://ideone.com/UIp9R

推荐答案

如果您怀疑姓名冲突是不常见的,可能足够快,只能走上树。

If you suspect name collisions will be uncommon, it may be fast enough to just walk up the tree.

如果碰撞是常见的,那么使用在有序前缀搜索上优化的数据结构(如树)可能会更快。您的各种子集组成前缀。您的索引键将包含选择器和总路径。

If collisions are common though, it might be faster to use a data structure that excels at ordered prefix searches, such as a tree. Your various subsets make up the prefix. Your index keys would then include both selectors and total paths.

对于DOM:

<path>
  <to>
    <element id="someid" class="someclass" someattribute="1"/>
  </to>
</path>

您将拥有以下索引键:

<element>/path/to/element
#someid>/path/to/element
.someclass>/path/to/element
@someattribute>/path/to/element

现在,如果您根据前缀搜索这些密钥,那么可以将查询限制到您想要的任何子集:

Now if you search these keys based on prefix, you can limit the query to any subset you want:

<element>           ; finds all <element>, regardless of path
.someclass>         ; finds all .someclass, regardless of path
.someclass>/path    ; finds all .someclass that exist in the subset /path
.someclass>/path/to ; finds all .someclass that exist in the subset /path/to
#id>/body           ; finds all #id that exist in the subset /body

一棵树可以找到下限元素> =您的搜索值)在 O (log n )中,并且因为它是从那里排序的,只需重复一遍,直到你不再匹配的密钥字首。这将非常快!

A tree can find the lower bound (the first element >= to your search value) in O(log n), and because it is ordered from there you simply iterate until you come to a key that no longer matches the prefix. It will be very fast!

.NET没有合适的树结构(它有SortedDictionary,但不幸的是没有暴露所需的 LowerBound 方法),因此您需要自己编写或使用现有的第三方。优秀的 C5 Generic Collection Library 具有适合 Range 方法。

.NET doesn't have a suitable tree structure (it has SortedDictionary but that unfortunately doesn't expose the required LowerBound method), so you'll need to either write your own or use an existing third party one. The excellent C5 Generic Collection Library features trees with suitable Range methods.

这篇关于索引搜索子集的数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆