尽可能快地对已知键集进行字符串键查找 [英] Fastest possible string key lookup for known set of keys

查看:149
本文介绍了尽可能快地对已知键集进行字符串键查找的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑一个带有以下签名的查找函数,它需要返回给定字符串键的整数:

  int GetValue (字符串键){...} 

进一步考虑键值映射,编号N,在编写函数源代码时预先知道,例如:

  // N = 3 
{ foo,1},
{bar,42},
{bazz,314159}

因此,上面输入的函数的一个有效(但不是完美的!)实现将是:

  {
switch(key)
{
casefoo:return 1;
casebar:return 42;
casebazz:return 314159;
}

//不管我们在这里做什么,控制都不会来到这一点
抛出新的Exception();
}

事先知道多少次(C> = 1)该函数将在运行时为每个给定的键调用。例如:

  C [foo] = 1; 
C [bar] = 1;
C [bazz] = 2;

然而,这种调用的顺序并不是已知的。例如。上面的代码可以在运行时描述下列调用序列:

  GetValue(foo); 
GetValue(bazz);
GetValue(bar);
GetValue(bazz);

或任何其他顺序,只要通话次数匹配即可。



还有一个限制M,以任何单位最方便的方式指定,定义可由 GetValue (结构被预先初始化;初始化不计入函数的复杂性)。例如,M = 100个字符,或者M = 256个sizeof(对象引用)。​​

问题是,如何编写 GetValue ,以便尽可能快 - 换句话说,所有 GetValue 调用的总时间(请注意,我们知道总计数对于给定的N,C和M,这个算法可能需要一个合理的最小值,例如,对于给定的N,C和M? M> = char.MaxValue 。它也可能要求M与某个合理的边界对齐 - 例如,它可能只是2的幂。它也可能要求M必须是某种N的函数(例如,它可以允许有效的M = N或M = 2N,...;或有效的M = N或M = N ^ 2, ...;等)。



算法可以用任何合适的语言或其他形式表达。对于生成的代码的运行时性能限制,假定生成的代码 GetValue 将使用C#,VB或Java(实际上,任何语言都可以,只要字符串被处理作为不可变的字符数组 - 即O(1)长度和O(1)索引,并且没有提前为其计算其他数据)。此外,为了简化这一点,假设对于所有密钥C = 1的答案被认为是有效的,尽管那些覆盖更一般情况的答案是优选的。

一些对可能的方法进行思考



上述明显的第一个答案是使用完美哈希,但是找到一个方法的通用方法似乎并不完美。例如,对于上面的示例数据,可以使用Pearson散列来轻松生成最小完美散列表,但是对于每次调用 GetValue ,并且Pearson散列必然会扫描整个输入字符串。但是所有示例键的第三个字符实际上是不同的,所以只有它可以用作散列的输入而不是整个字符串。此外,如果M要求至少 char.MaxValue ,那么第三个字符本身就是一个完美的散列。



对于不同的一组键,这可能不再是真实的,但仍然可以在给出精确答案之前减少考虑的字符数量。此外,在一些 minimal 完美散列需要检查整个字符串的情况下,可能会减少查找子集,或者使其更快(例如,不太复杂的散列函数?)。通过使散列非最小(即M> N) - 有效地牺牲了空间,以提高速度。



也可能是传统的哈希不是很好的想法开始,并且将 GetValue 的主体作为一系列条件结构更容易,其排列使得首先检查最可变字符(即大多数密钥都不相同),并根据需要进一步嵌套检查以确定正确的答案。请注意,这里的变化可能会受到每个键将被查找的次数的影响(C)。此外,分支的最佳结构应该是什么并不总是很明显 - 例如,它可能是最可变的字符只能让您区分100个中的10个键,而对于其余的90个,则额外的一个检查是不必要区分它们的,平均而言(考虑到C)每个关键字的检查次数多于不同的解决方案,而不是以最大变量字符开始。然后,我们的目标是确定完美的检查顺序。 您可以使用 en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithmrel =nofollow> Boyer 搜索,但我认为 Trie 将是一个更有效的方法。您可以修改Trie,以便在您为关键零点创建命中数时折叠这些单词,从而减少您在搜索范围之外进行的搜索次数。您将得到的最大好处是您正在为索引执行数组查找,这比比较快得多。


Consider a lookup function with the following signature, which needs to return an integer for a given string key:

int GetValue(string key) { ... }

Consider furthermore that the key-value mappings, numbering N, are known in advance when the source code for function is being written, e.g.:

// N=3
{ "foo", 1 },
{ "bar", 42 },
{ "bazz", 314159 }

So a valid (but not perfect!) implementation for the function for the input above would be:

int GetValue(string key)
{
    switch (key)
    {
         case "foo": return 1;
         case "bar": return 42;
         case "bazz": return 314159;
    }

    // Doesn't matter what we do here, control will never come to this point
    throw new Exception();
}

It is also known in advance exactly how many times (C>=1) the function will be called at run-time for every given key. For example:

C["foo"] = 1;
C["bar"] = 1;
C["bazz"] = 2;

The order of such calls is not known, however. E.g. the above could describe the following sequence of calls at run-time:

GetValue("foo");
GetValue("bazz");
GetValue("bar");
GetValue("bazz");

or any other sequence, provided the call counts match.

There is also a restriction M, specified in whatever units is most convenient, defining the upper memory bound of any lookup tables and other helper structures that can be used by the GetValue (the structures are initialized in advance; that initialization is not counted against the complexity of the function). For example, M=100 chars, or M=256 sizeof(object reference).

The question is, how to write the body of GetValue such that it is as fast as possible - in other words, the aggregate time of all GetValue calls (note that we know the total count, per everything above) is minimal, for given N, C and M?

The algorithm may require a reasonable minimal value for M, e.g. M >= char.MaxValue. It may also require that M be aligned to some reasonable boundary - for example, that it may only be a power of two. It may also require that M must be a function of N of a certain kind (for example, it may allow valid M=N, or M=2N, ...; or valid M=N, or M=N^2, ...; etc).

The algorithm can be expressed in any suitable language or other form. For runtime performance constrains for generated code, assume that the generated code for GetValue will be in C#, VB or Java (really, any language will do, so long as strings are treated as immutable arrays of characters - i.e. O(1) length and O(1) indexing, and no other data computed for them in advance). Also, to simplify this a bit, answers which assume that C=1 for all keys are considered valid, though those answers which cover the more general case are preferred.

Some musings on possible approaches

The obvious first answer to the above is using a perfect hash, but generic approaches to finding one seem to be imperfect. For example, one can easily generate a table for a minimal perfect hash using Pearson hashing for the sample data above, but then the input key would have to be hashed for every call to GetValue, and Pearson hash necessarily scans the entire input string. But all sample keys actually differ in their third character, so only that can be used as the input for the hash instead of the entire string. Furthermore, if M is required to be at least char.MaxValue, then the third character itself becomes a perfect hash.

For a different set of keys this may no longer be true, but it may still be possible to reduce the amount of characters considered before the precise answer can be given. Furthermore, in some cases where a minimal perfect hash would require inspecting the entire string, it may be possible to reduce the lookup to a subset, or otherwise make it faster (e.g. a less complex hashing function?) by making the hash non-minimal (i.e. M > N) - effectively sacrificing space for the sake of speed.

It may also be that traditional hashing is not such a good idea to begin with, and it's easier to structure the body of GetValue as a series of conditionals, arranged such that the first checks for the "most variable" character (the one that varies across most keys), with further nested checks as needed to determine the correct answer. Note that "variance" here can be influenced by the number of times each key is going to be looked up (C). Furthermore, it is not always readily obvious what the best structure of branches should be - it may be, for example, that the "most variable" character only lets you distinguish 10 keys out of 100, but for the remaining 90 that one extra check is unnecessary to distinguish between them, and on average (considering C) there are more checks per key than in a different solution which does not start with the "most variable" character. The goal then is to determine the perfect sequence of checks.

解决方案

You could use the Boyer search, but I think that the Trie would be a much more effiecent method. You can modify the Trie to collapse the words as you make the hit count for a key zero, thus reducing the number of searches you would have to do the farther down the line you get. The biggest benefit you would get is that you are doing array lookups for the indexes, which is much faster than a comparison.

这篇关于尽可能快地对已知键集进行字符串键查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆