如何为.net中的Metaphone/Soundex名称搜索计算分数 [英] How to calculate score for Metaphone/Soundex name searching in .net

查看:40
本文介绍了如何为.net中的Metaphone/Soundex名称搜索计算分数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在与soundex或Meta-phone匹配的搜索中,我必须获得名称字段的分数.例如:如果我搜索了"JOHN DOE",那么在此搜索参数上我会听到所有与之匹配的声音.它将返回类似于其soundex或Meta-phone匹配的大量记录.所以我需要根据获得的数据提供一个分数,以便最匹配的数据可以被获取或显示在列表的顶部,就像明智的用户可以从列表中获取85%或90%的匹配数据.请提供有关在C#中为soundex或Meta-phone获得的值创建分数的技术的帮助

I have to obtain a score for the name field in the search with soundex or Meta-phone matching. For Eg: if i searched "JOHN DOE" i took all the sounds like matching on this search parameter. It will return a vast records similar to its soundex or Meta-phone matching. So i need to provide a score based on the obtained data so that the most matched data can be taken or shown on top of the list.Like wise user can take 85% or 90% matching data from the list. Please help with technique to create score in c# for soundex or Meta-phone obtained values

推荐答案

我假设您搜索所有字符串,并过滤掉查询字符串中包含所有soundex代码的字符串.因此,例如,如果查询为"John Doe",则您将有两个soundex代码,一个用于John,另一个用于DOE.因此,接下来,您将检索至少具有这两个soundex代码的所有字符串.

I'm assuming you search all your strings and filter out the ones which has ALL soundex codes in the query string. So for example, if query is "John Doe" then you would have have two soundex codes, one for John and other for DOE. So next, you would retrieve all strings that have at least these two soundex codes.

现在,如果您获得太多记录,则需要从

Now if you get too many records then you need to apply techniques from the domain of Information Retrieval to rank your results. There are unfortunately many ways to do it. I'll describe some of my favorite ways in increasing order of complexity:

  1. 使用编辑距离对字符串进行排名.您将具有函数GetEditDistance(s1,s2),它基本上返回您需要在s1中进行以获取s2的添加/更新/删除的数量.这非常简单,您可以从此处获取代码和更多信息:
  2. 最后,如果您想像我上面链接的IR书中那样正确"地进行操作,则需要首先计算
  1. Use edit distance to rank your strings. You would have function GetEditDistance(s1, s2) and it basically returns number of add/update/deletes you need to do in s1 to get s2. This is fairly simple and you can get code and more info from here: How to calculate distance similarity measure of given 2 strings?.
  2. Use similarity metric such as Jaccard similarity. You basically take two strings and get ratio of count of common characters divided by count of all distinct characters. This is character-level Jaccard score. You can also do it token level. For example, token level Jaccard score between "John Doe" and "John Wolfenstein" is 1/3 but for "John Doe" and "John F. Doe", the score is 2/3. Other similarity metrics are Dice and Cosine which are also very easy to calculate and has dedicted Wikipedia pages.
  3. Finally if you want to do it "properly" as in the IR book I linked above, then you need to first calculate TF/IDF. This essentially assigns a weight to each term that is in your records. If term is occurring too many times (like John) then its weight would be lower. If term is rather rare (like Wolfenstein) then its weight is higher. Once you have weights you basically use similarity metric I described in #2.

已更新,例如OP的评论

在您的示例中,查询是osama,结果是osama,ossama,ussama,oswin,ASAMOAH.在我看来,Dice系数或余弦相似度最适合您的情况.计算Dice系数非常容易,因此在这里我将使用它,但是您可能还想尝试余弦相似度.

In your examplem the query is osama and results are osama,ossama,ussama,oswin,ASAMOAH. It looks to me that Dice coefficient or Cosine similarity would be best in your case. Calculating Dice coefficient is very easy so I'll use that here but you might want to experiment with Cosine similarity also.

要计算字符级骰子系数,请使用以下公式:

To calculate character level Dice coefficient, use following formula:

Dice coefficient = 2 * (count of common characters between query and result) / (sum of all characters in query and result)

例如,osama和ossama之间的骰子系数为 2 * 5/(5 + 6)= 0.91 .

For example, Dice coefficient between osama and ossama is 2*5/(5+6)=0.91.

以下是查询osama的所有结果的骰子:

Below are the Dice for all results for query osama:

osama   osama   ->  1.00
osama   ossama  ->  0.91
osama   ussama  ->  0.72
osama   oswin   ->  0.40
osama   ASAMOAH ->  0.83

所以排名结果将是osama,ossama,ASAMOAH,ussama,oswin,对我来说合理.

So the ranked results would be osama, ossama, ASAMOAH, ussama, oswin which looks reasonable to me.

这篇关于如何为.net中的Metaphone/Soundex名称搜索计算分数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆