超级模糊名称检查? [英] Super fuzzy name checking?

查看:159
本文介绍了超级模糊名称检查?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的一些东西进行一个内部的CRM。该公司目前的前端可以容纳更多的复制品。我试图从,因为他们搜索了比尔·约翰逊',而不是把在同一个人身上阻止最终用户的威廉·约翰逊。因此,用户会提出一些有关他们的新客户,我们会发现类似的名称(包括模糊名称)和配合他们对什么已经是在我们的数据库,并询问他们是否意味着这些东西做这样一个数据库或技术存在吗?

I'm working on some stuff for an in-house CRM. The company's current frontend allows for lots of duplicates. I'm trying to stop end-users from putting in the same person because they searched for 'Bill Johnson' and not 'William Johnson.' So the user will put in some information about their new customer and we'll find the similar names (including fuzzy names) and match them against what is already in our database and ask if they meant those things... Does such a database or technology exist?

推荐答案

我实现这样的功能在一个网站上。我在PHP中使用double_metaphone()+莱文施泰因()。我precalculate在dabatase每个条目double_metaphone(),我用查找的metaphoned搜索项。

I implemented such a functionality on one website. I use double_metaphone() + levenstein() in PHP. I precalculate a double_metaphone() for each entry in the dabatase, which I lookup using a SELECT of the first x chars of the 'metaphoned' searched term.

然后我根据自己的Levenstein距离排序返回的结果。 double_metaphone()不属于任何PHP库(我最后一次检查)的一部分,所以我借了PHP实现我在什么地方找到一个很久以前在网上(网站上线不再)。我的地方张贴我想。

Then I sort the returned result according to their levenstein distance. double_metaphone() is not part of any PHP library (last time I checked), so I borrowed a PHP implementation I found somewhere a long while ago on the net (site no longer on line). I should post it somewhere I suppose.

编辑:该网站仍处于archive.org:
http://web.archive.org/web/20080728063208/http://swoodbridge.com/DoubleMetaPhone/

The website is still in archive.org: http://web.archive.org/web/20080728063208/http://swoodbridge.com/DoubleMetaPhone/

或谷歌缓存:
<一href=\"http://webcache.googleusercontent.com/search?q=cache:Tr9taWl9hMIJ:swoodbridge.com/DoubleMetaPhone/+Stephen+Woodbridge+double_metaphon\" rel=\"nofollow\">http://webcache.googleusercontent.com/search?q=cache:Tr9taWl9hMIJ:swoodbridge.com/DoubleMetaPhone/+Stephen+Woodbridge+double_metaphon

这导致了与源$ C ​​$ c代表double_metaphone()其他许多有用的链接,其中包括在Javascript在GitHub上:的 http://github.com/maritz/js-double-metaphone

which leads to many other useful links with source code for double_metaphone(), including one in Javascript on github: http://github.com/maritz/js-double-metaphone

修改:通过我的老code去了,这里大概是我做什么步骤,伪$ C $定期存款,以便保持明确:

EDIT: Went through my old code, and here are roughly the steps of what I do, pseudo coded to keep it clear:

1)precompute一个double_metaphone()在数据库中,即$字='blahblah'的每一个字; $ soundslike = double_metaphone($字);

1) Precompute a double_metaphone() for every word in the database, i.e., $word='blahblah'; $soundslike=double_metaphone($word);

2)在查找时间,$词是模糊检索针对数据库:$ soundslike = double_metaphone($字)

2) At lookup time, $word is fuzzy-searched against the database: $soundslike = double_metaphone($word)

4)SELECT * FROM ,其中 soundlike LIKE $ soundlike(如果莱文施泰因存储为一个过程,的更好的:SELECT * FROM表,其中莱文施泰因( soundlike ,$ soundlike)LT; mythreshold ORDER BY莱文施泰因(,$字)ASC限制...等。

4) SELECT * FROM table WHERE soundlike LIKE $soundlike (if you have levenstein stored as a procedure, much better: SELECT * FROM table WHERE levenstein(soundlike,$soundlike) < mythreshold ORDER BY levenstein(word,$word) ASC LIMIT ... etc.

它运作良好,对我来说,虽然我不能使用存储过程,因为我对服务器没有控制权,它的使用MySQL 4.20或东西。

It has worked well for me, although I can't use a stored procedure, since I have no control over the server and it's using MySQL 4.20 or something.

这篇关于超级模糊名称检查?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆