搜索的名字在C#中的列表的最快方法 [英] Fastest way to search a list of names in C#

查看:115
本文介绍了搜索的名字在C#中的列表的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在内存中可能有10万名串在我的应用程序的列表。我需要找到包含特定关键字(不区分大小写)排名前20位的字符串。这是很容易做到的,我只是运行下面的LINQ。

I have a list of perhaps 100,000 strings in memory in my application. I need to find the top 20 strings that contain a certain keyword (case insensitive). That's easy to do, I just run the following LINQ.

from s in stringList
where s.ToLower().Contains(searchWord.ToLower())
select s

不过,我有一个独特的觉得我能做到这一点更快,我需要找到办法,因为我需要每秒多次查找在此列表中。

However, I have a distinct feeling that I could do this much faster and I need to find the way to that, because I need to look up in this list multiple times per second.

推荐答案

另一种选择,虽然这将需要的存储器相当数量的,将预先计算类似的后缀阵列(字符串内的位置的列表,由它们指向的字符串排序)。

Another option, although it would require a fair amount of memory, would be to precompute something like a suffix array (a list of positions within the strings, sorted by the strings to which they point).

http://en.wikipedia.org/维基/ Suffix_array

这将是最可行的,如果你对搜索字符串列表是相对静态的。串的索引的整个列表可以存储在元组(indexOfString,positionInString)的单个阵列,在其上需要执行二进制搜索,使用的String.Compare(关键字,0,目标,targetPos,关键字。长度)

This would be most feasible if the list of strings you're searching against is relatively static. The entire list of string indexes could be stored in a single array of tuples(indexOfString, positionInString), upon which you would perform a binary search, using String.Compare(keyword, 0, target, targetPos, keyword.Length).

所以,如果你有20的平均长度为100000的字符串,则需要10万* 20 * 2 * sizeof的(INT)的存储器的结构。你可以减少,在一半通过在最低12位与positionInString包装既indexOfString和positionInString成单个32位的int,例如,和在剩余的高位比特的indexOfString。你只需要做一点点摆弄,以获得两个值退了出去。值得注意的是,结构不包含字符串或子本身是很重要的。你对搜索字符串只出现一次。

So if you had 100,000 strings of average 20 length, you would need 100,000 * 20 * 2*sizeof(int) of memory for the structure. You could cut that in half by packing both indexOfString and positionInString into a single 32 bit int, for example with positionInString in the lowest 12 bits, and the indexOfString in the remaining upper bits. You'd just have to do a little bit fiddling to get the two values back out. It's important to note that the structure contains no strings or substrings itself. The strings you're searching against exist only once.

这将基本上给你一个完整的索引,并允许在索引中的后缀找到任何子很快(二进制搜索数组代表),以最小的实际字符串比较。

This would basically give you a complete index, and allow finding any substring very quickly (binary search over the index the suffix array represents), with a minimum of actual string comparisons.

如果内存是亲爱的,原始的蛮力算法的一个简单的优化将是预先计算的独特字符的字典,并分配序数来表示每个。然后预先计算每个字符串为包含在字符串中的每个字符的独特设置位的位阵列。由于您的字符串是比较短的,应该有resuting BitArrays的变化相当数量的(如果你的字符串是非常长的,将无法正常工作)。然后,您只需计算BitArray或关键字搜索,并且只搜索在这些字符串,其中 keywordBits和放大器的关键字; targetBits == keywordBits 。如果你的字符串预变换为小写,而仅仅是个英文字母,该BitArray可能会一个int中适合。因此,这将至少需要额外的内存,可以实现简单,并允许您快速筛选出的字符串中,你肯定不会找到关键字。这可能是一个非常有用的优化,因为字符串搜索速度快,但你有这么多的人使用蛮力搜索的事情。

If memory is dear, a simple optimization of the original brute force algorithm would be to precompute a dictionary of unique chars, and assign ordinal numbers to represent each. Then precompute a bit array for each string with the bits set for each unique char contained within the string. Since your strings are relatively short, there should be a fair amount of variability of the resuting BitArrays (it wouldn't work well if your strings were very long). You then simply compute the BitArray or your search keyword, and only search for the keyword in those strings where keywordBits & targetBits == keywordBits. If your strings are preconverted to lower case, and are just the English alphabet, the BitArray would likely fit within a single int. So this would require a minimum of additional memory, be simple to implement, and would allow you to quickly filter out strings within which you will definitely not find the keyword. This might be a useful optimization since string searches are fast, but you have so many of them to do using the brute force search.

修改对于那些有兴趣,这里是一个基本的实现我提出的初步解决方案。我跑了使用10万随机生成由OP描述长度字符串测试。虽然花了30秒左右,构建和排序的指标,一旦做出的搜索关键字3000倍的速度是49805毫秒蛮力,并使用索引搜索18毫秒,所以快一对夫妇一千次。如果你很少生成列表,那么我的简单,但最初建造的后缀数组应该足以比较慢的方法。还有更聪明的方法来构建它的速度更快,但需要比我下面基本实现更多的代码。

EDIT For those interested, here is a basic implementation of the initial solution I proposed. I ran tests using 100,000 randomly generated strings of lengths described by the OP. Although it took around 30 seconds to construct and sort the index, once made, the speed of searching for keywords 3000 times was 49,805 milliseconds for brute force, and 18 milliseconds using the indexed search, so a couple thousand times faster. If you rarely build the list, then my simple, but relatively slow method of initially building the suffix array should be sufficient. There are smarter ways to build it that are faster, but would require more coding than my basic implementation below.

// little test console app
static void Main(string[] args) {
    var list = new SearchStringList(true);
    list.Add("Now is the time");
    list.Add("for all good men");
    list.Add("Time now for something");
    list.Add("something completely different");
    while (true) {
        string keyword = Console.ReadLine();
        if (keyword.Length == 0) break;
        foreach (var pos in list.FindAll(keyword)) {
            Console.WriteLine(pos.ToString() + " =>" + list[pos.ListIndex]);
        }
    }
}
~~~~~~~~~~~~~~~~~~
// file for the class that implements a simple suffix array
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections;

namespace ConsoleApplication1 {
    public class SearchStringList {
        private List<string> strings = new List<string>();
        private List<StringPosition> positions = new List<StringPosition>();
        private bool dirty = false;
        private readonly bool ignoreCase = true;

        public SearchStringList(bool ignoreCase) {
            this.ignoreCase = ignoreCase;
        }

        public void Add(string s) {
            if (s.Length > 255) throw new ArgumentOutOfRangeException("string too big.");
            this.strings.Add(s);
            this.dirty = true;
            for (byte i = 0; i < s.Length; i++) this.positions.Add(new StringPosition(strings.Count-1, i));
        }

        public string this[int index] { get { return this.strings[index]; } }

        public void EnsureSorted() {
            if (dirty) {
                this.positions.Sort(Compare);
                this.dirty = false;
            }
        }

        public IEnumerable<StringPosition> FindAll(string keyword) {
            var idx = IndexOf(keyword);
            while ((idx >= 0) && (idx < this.positions.Count)
                && (Compare(keyword, this.positions[idx]) == 0)) {
                yield return this.positions[idx];
                idx++;
            }
        }

        private int IndexOf(string keyword) {
            EnsureSorted();

            // binary search
            // When the keyword appears multiple times, this should
            // point to the first match in positions. The following
            // positions could be examined for additional matches
            int minP = 0;
            int maxP = this.positions.Count - 1;
            while (maxP > minP) {
                int midP = minP + ((maxP - minP) / 2);
                if (Compare(keyword, this.positions[midP]) > 0) {
                    minP = midP + 1;
                } else {
                    maxP = midP;
                }
            }
            if ((maxP == minP) && (Compare(keyword, this.positions[minP]) == 0)) {
                return minP;
            } else {
                return -1;
            }
        }

        private int Compare(StringPosition pos1, StringPosition pos2) {
            int len = Math.Max(this.strings[pos1.ListIndex].Length - pos1.StringIndex, this.strings[pos2.ListIndex].Length - pos2.StringIndex);
            return String.Compare(strings[pos1.ListIndex], pos1.StringIndex, this.strings[pos2.ListIndex], pos2.StringIndex, len, ignoreCase);
        }

        private int Compare(string keyword, StringPosition pos2) {
            return String.Compare(keyword, 0, this.strings[pos2.ListIndex], pos2.StringIndex, keyword.Length, this.ignoreCase);
        }

        // Packs index of string, and position within string into a single int. This is
        // set up for strings no greater than 255 bytes. If longer strings are desired,
        // the code for the constructor, and extracting  ListIndex and StringIndex would
        // need to be modified accordingly, taking bits from ListIndex and using them
        // for StringIndex.
        public struct StringPosition {
            public static StringPosition NotFound = new StringPosition(-1, 0);
            private readonly int position;
            public StringPosition(int listIndex, byte stringIndex) {
                this.position = (listIndex < 0) ? -1 : this.position = (listIndex << 8) | stringIndex;
            }
            public int ListIndex { get { return (this.position >= 0) ? (this.position >> 8) : -1; } }
            public byte StringIndex { get { return (byte) (this.position & 0xFF); } }
            public override string ToString() {
                return ListIndex.ToString() + ":" + StringIndex;
            }
        }
    }
}

这篇关于搜索的名字在C#中的列表的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆