搜索文件并返回值 - 超快 [英] Searching a file and returning value - Super Fast

查看:43
本文介绍了搜索文件并返回值 - 超快的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组数据,其中有一个名称、一些子值和一个关联的数值.例如:

I have a set of data which has a name, some sub values and then a associative numeric value. For example:

James Value1 Value2 "1.232323/1.232334"
Jim   Value1 Value2 "1.245454/1.232999"
Dave  Value1 Value2 "1.267623/1.277777"

将有大约 100,000 个这样的条目存储在文件或数据库中.我想知道,能够返回与搜索匹配的结果及其相关数值的最快方法是什么.

There will be around 100,000 entries like this stored in either a file or database. I would like to know, what is the quickest way of being able to return the results which match a search, along with their associated numeric value.

例如,查询J"将同时返回 James 和 Jim 结果,即最后一列中的数值.

For example, a query of "J" would return both James and Jim results which the numeric values in the last column.

我听说有人提到过二叉树搜索、字典搜索、索引搜索.我不知道哪个是阅读的好方法.

I've heard people mention binary tree searching, dictionary searching, indexed searching. I have no idea which is a good route to peruse.

推荐答案

这是一个特征不佳的问题.与许多优化问题一样,资源也存在权衡.如果您真的想要尽可能快的响应,那么一种可能的方法是将所有可能的搜索编译到一个准备好的结果表中,这样,在给定搜索键的情况下,您可以在表中查找搜索键并返回结果.

This is a poorly characterized problem. As with many optimization problems, there are trade-offs in resources. If you truly want the fastest response possible, then a likely approach is to compile all possible searches into a table of prepared results, so that, given a search key, you can look the search key up in the table and return the result.

假设您的字符集仅限于 A-Z 和 a-z,那么按照今天的标准,每个搜索键的条目从 0 到 4 个字符的表将使用适度的内存量.每个表条目只需要在其中包含两个值: 数值列表中的开始和结束位置.(以这种方式编译列表:按名称字段对记录进行排序.仅从记录中提取数值,保持顺序,将它们放入列表中.任何搜索键都必须返回该列表中连续记录的子列表.这是因为搜索的是名称字段的前缀字符串,所以任何匹配搜索关键字的记录都是相邻的,当按名称字段排序时.)

Assuming your character set is limited to A-Z and a-z, a table with an entry for each search key from 0 to 4 characters will use a modest amount of memory by today’s standards. Each table entry merely needs to have two values in it: The start and end positions in a list of the numeric values. (Compile the list in this way: Sort the records by the name field. Extract just the numeric values from the records, maintaining the order, putting them in a list. Any search key must return a sublist of contiguous records from that list. This is because the search is for a prefix string of the name field, so any records that match the search key are adjacent, when sorted by the name field.)

因此,要创建一个表来查找 0 到 4 个字符的任意键,您需要在对表中少于 534 个条目,其中对的每个成员都包含一个记录号(32 位或更少).所以 8•534 = 60.2 MiB 就足够了.(53 是因为你有 52 个字符加上一个标记字符来标记密钥的结尾.替代编码可以减少一些.)

Thus, to create a table to look up any key of 0 to 4 characters, you need fewer than 534 entries in a table of pairs, where each member of the pair contains a record number (32 bits or fewer). So 8•534 = 60.2 MiB suffices. (53 is because you have 52 characters plus one sentinel character to mark the end of the key. Alternate encodings could reduce this some.)

要支持超过 4 个字符的键,您需要对其进行扩展.对于典型数据,4个字符将大大缩小搜索范围,因此您可以将前4个字符表示的记录集剪枝得到最终结果.如果数据有病态的情况,4个字符并没有减少多少搜索,你可以修饰这个技巧.

To support keys of more than 4 characters, you need to extend this. With typical data, 4 characters will have narrowed down the search greatly, so you can take the set of records indicated by the first 4 characters and prune it to get the final results. If the data has pathological cases where 4 characters does not reduce the search much, you can embellish this technique.

那么,这真的是你想要做的,让速度尽可能快,而不考虑其他资源(包括工程时间)的消耗吗?如果没有,您的实际目标是什么?

So, is that really what you want to do, make the speed as fast as possible, regardless of other resources (including engineering time) consumed? If not, what are your actual goals?

这篇关于搜索文件并返回值 - 超快的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆