创建“拼写检查”;用合理的运行时间检查数据库 [英] Creating a "spell check" that checks against a database with a reasonable runtime

查看:84
本文介绍了创建“拼写检查”;用合理的运行时间检查数据库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不是要问自己实施拼写检查算法的问题。我有一个包含数十万条记录的数据库。我要做的是对照表中的特定列检查用户输入的所有这些记录,并返回具有特定汉明距离的匹配项(同样,此问题不是关于确定汉明距离的问题,等等)。当然,目的是创建您是不是要功能,用户在其中搜索名称,如果在数据库中未找到直接匹配项,则返回可能匹配项的列表。

I'm not asking about implementing the spell check algorithm itself. I have a database that contains hundreds of thousands of records. What I am looking to do is checking a user input against a certain column in a table for all these records and return any matches with a certain hamming distance (again, this question's not about determining hamming distance, etc.). The purpose, of course, is to create a "did you mean" feature, where a user searches a name, and if no direct matches are found in the database, a list of possible matches are returned.

我正在尝试提出一种在尽可能最合理的运行时间中进行所有这些检查的方法。我该如何以最有效的方式对照所有这些记录检查用户的输入?

I'm trying to come up with a way to do all of these checks in the most reasonable runtime possible. How can I check a user's input against all of these records in the most efficient way possible?

该功能当前已实现,但运行速度非常慢。现在,它的工作方式是将用户指定表中的所有记录加载到内存中,然后执行检查。

The feature is currently implemented, but the runtime is exceedingly slow. The way it works now is it loads all records from a user-specified table (or tables) into memory and then performs the check.

对于它的价值,我正在使用NHibernate进行数据访问。

For what it's worth, I'm using NHibernate for data access.

对于我如何做到这一点或选择什么的反馈,我将不胜感激。

I would appreciate any feedback on how I can do this or what my options are.

推荐答案

计算Levenshtein距离不必像您想象的那样昂贵。可以将 Norvig文章中的代码视为伪代码,以帮助读者理解算法。一种更有效的实现方式(以我为例,在20,000个术语数据集上,速度要快大约300倍)是走 trie 。性能差异主要归因于无需分配数百万个字符串即可进行字典查找,在GC中花费的时间少得多,并且您还获得了更好的引用位置,从而减少了CPU缓存丢失。通过这种方法,我可以在Web服务器上大约2毫秒内进行查找。额外的好处是能够轻松返回以提供的字符串开头的所有结果。

Calculating Levenshtein distance doesn't have to be as costly as you might think. The code in the Norvig article can be thought of as psuedocode to help the reader understand the algorithm. A much more efficient implementation (in my case, approx 300 times faster on a 20,000 term data set) is to walk a trie. The performance difference is mostly attributed to removing the need to allocate millions of strings in order to do dictionary lookups, spending much less time in the GC, and you also get better locality of reference so have fewer CPU cache misses. With this approach I am able to do lookups in around 2ms on my web server. An added bonus is the ability to return all results that start with the provided string easily.

缺点是创建特里树很慢(可能需要一秒钟左右的时间) ,因此,如果源数据定期更改,那么您需要确定是重建整个对象还是应用增量。无论如何,一旦构建结构,您就想尽可能地重用它。

The downside is that creating the trie is slow (can take a second or so), so if the source data changes regularly then you need to decide whether to rebuild the whole thing or apply deltas. At any rate, you want to reuse the structure as much as possible once it's built.

这篇关于创建“拼写检查”;用合理的运行时间检查数据库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆