如何实现对大规模数据集自动完成 [英] How to implement autocomplete on a massive dataset

查看:124
本文介绍了如何实现对大规模数据集自动完成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想实现像谷歌建议在网站上我建设和很好奇如何去一个非常大的数据集做英寸如果确保你有1000个项目您缓存通过他们的项目,只是循环。但你如何去了解它,当你有一个亿的项目?此外,假设项目不是一个字。具体来说,我已经被Pandora.com pssed真的IM $ P $。例如,如果你搜索湿它带回湿砂,但同时也带来了回来蟾蜍的湿链轮。而他们的自动完成快。我的第一个想法是由组的头两个字母的项目,所以你会碰到这样的:

I'm trying to implement something like Google suggest on a web site I am building and am curious how to go about doing in on a very large data set. Sure if you've got 1000 items you cache the items and just loop through them. But how do you go about it when you have a million items? Further, suppose that the items are not one word. Specifically, I have been really impressed by Pandora.com. For example, if you search for "wet" it brings back "Wet Sand" but it also brings back Toad The Wet Sprocket. And their autocomplete is FAST. My first idea was to group the items by the first two letters, so you would have something like:

Dictionary<string,List<string>>

,其中最关键的是前两个字母。这是确定的,但如果我想要做类似的事情潘多拉,并允许用户查看匹配字符串中间的结果是什么?随着我的想法:湿永远不会匹配蟾蜍的湿链轮,因为这将是TO斗,而不是我们斗。所以,那么也许你的可能的向上分割字符串和蟾蜍的湿链轮走在TO,我们和SP桶(剥离出单词the),但是当你在谈论一万个条目可能有说几句话的每个可能的话,这似乎是你快速开始使用了大量的内存。确定这是一个长期的问题。思考?

where the key is the first two letters. That's OK, but what if I want to do something similar to pandora and allow the user to see results that match the middle of the string? With my idea: Wet would never match Toad the Wet Sprocket because it would be in the "TO" bucket instead of the "WE" bucket. So then perhaps you could split the string up and "Toad the Wet Sprocket" go in the "TO", "WE" and "SP" buckets (strip out the word "THE"), but when you're talking about a million entries which may have say a few words each possibly, that seems like you'd quickly start using up a lot of memory. Ok that was a long question. Thoughts?

推荐答案

我在<一指出, href=\"http://stackoverflow.com/questions/658235/how-to-implement-incremental-search-on-a-list/658336#658336\">How到名单上实现增量搜索你应该使用结构像一个特里或<一个HREF =htt​​p://en.wikipedia.org/wiki/Radix_tree相对=nofollow>帕特里夏特里在大型文本搜索模式。

As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.

和对一些文本的中间发现的模式有一个简单的解决方案。我不知道这是否是最efficent的解决方案,但我通常做如下:

And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficent solution, but I usually do it as follows.

当我插入了一些新的文本到特里,我只是插入,然后直到整个文本被消耗除去第一个字符,再次插入,删除第二个字符,再插入...等等。然后,您可以用从根只是一个搜索发现每一个插入的文本的各个子。这导致结构称为后缀树并有可用的很多优化工作。

When I insert some new text into the Trie, I just insert it, then remove the first character, insert again, remove the second character, insert again ... and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root. That resulting structure is called a Suffix Tree and there are a lot of optimizations available.

和它真是不可思议快。要查找包含有最多n个节点进行检查,并执行儿童为每个节点的名单上搜索n个字符一个给定的序列中的所有文本。根据不同的子节点集合的实现(数组列表,二叉树,skiplist),你也许可以用假设只有不区分大小写拉丁字母少至5搜索步骤确定所需的子节点。那些通常发现根部附近插值排序可能是大字母和许多孩子的节点有帮助的。

And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of children for every node. Depending on the implementation (array, list, binary tree, skiplist) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many childs as those usually found near the root.

这篇关于如何实现对大规模数据集自动完成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆