如何更正用户输入(Google的种类“您是不是要找?") [英] How to correct the user input (Kind of google "did you mean?")

查看:65
本文介绍了如何更正用户输入(Google的种类“您是不是要找?")的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下要求:-

我有很多(比如说一百万个)值(名称). 用户将输入搜索字符串.

我不希望用户正确拼写名称.

所以,我想用Google来形容您是不是故意的".这将列出我的数据存储区中所有可能的值.还有一个类似但不相同的问题这里.这没有回答我的问题.

我的问题:- 1)我认为不建议将这些数据存储在RDBMS中.因为那样我就不会对SQL查询进行过滤.而且我必须进行全表扫描.那么,在这种情况下应该如何存储数据?

2)第二个问题与 Levenshtein距离,这是一种查找可能的字符串的好方法.但是,我的问题又是我是否必须对我的数据存储中的所有100万个值进行运算?

3)我知道,Google通过观察用户的行为来做到这一点.但是我想在不观察用户行为的情况下做到这一点,即通过使用,我还不知道,比如说距离算法.因为前一种方法需要大量的搜索才能开始!

4)正如 Kirk Broadhurst 指出的那样,答案是解决方案

Soundex算法可能会帮助您解决这一问题.

http://en.wikipedia.org/wiki/Soundex

您可以为每个名称预先生成soundex值并将其存储在数据库中,然后为该索引编制索引以避免扫描表.

I have the following requirement: -

I have many (say 1 million) values (names). The user will type a search string.

I don't expect the user to spell the names correctly.

So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.

My question: - 1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?

2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set? Suppose, there is a name Franky in the dataset. If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?

I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?

3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!

4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -

  • Users mistyping a word (an edit distance algorithm)
  • Users not knowing a word and guessing (a phonetic match algorithm)

I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.

解决方案

The Soundex algorithm may help you out with this.

http://en.wikipedia.org/wiki/Soundex

You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.

这篇关于如何更正用户输入(Google的种类“您是不是要找?")的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆