确定两个名字是否彼此接近 [英] Determine if two names are close to each other

查看:42
本文介绍了确定两个名字是否彼此接近的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为我的学校建立一个系统,可以在聚会和其他活动中检查学生是否被列入黑名单.对我来说,检查学生是否被列入黑名单很容易,因为我可以在数据库中查找该学生,看看他/她是否被列入黑名单.

这是困难所在.

在我们的聚会上,每个学生可以邀请一个人.从理论上讲,被列入黑名单的学生可以被另一名学生邀请并绕过系统.我无法检查访客表中是否有列入黑名单的学生,因为邀请您的访客时仅提供了姓名.

因此,我需要检查列入黑名单的名字是否与客人的名字接近,并且如果他们接近,则显示警告,很遗憾,要考虑一些因素.

名称可以完全不同.在丹麦,标准名称包含三个名称",例如"Niels Faurskov Andersen"但是学生可以只键入"Niels Faurskov"或"Niels Andersen",甚至删除一些字符.

因此可以使用Niels Faurskov Andersen这样的全名

  • 尼尔斯·安徒生
  • 尼尔斯·福尔斯科夫
  • 尼尔斯·福尔斯科夫·安徒生
  • 尼尔斯·福尔斯科夫·安徒生
  • 尼尔斯·安徒生
  • 尼尔斯·福尔斯科夫
  • 尼尔斯·福尔斯科夫

依此类推...

另一件事是丹麦字母除通常的a-z外还包含æøå".这样说,整个站点和数据库都是UTF-8编码的.

我研究了各种方法来检查两个字符串之间的差异,而Levenshtein距离并不能完全做到这一点.

我在StackOverflow上找到了该线程:具有O(N 3 )复杂度,因此使用较大的比较集将非常慢.

但是您仍然可以通过动态检查来改善当前的解决方案.现在,此代码将首先生成所有字符串子序列,然后开始一个接一个地检查它们.通常情况下,您不需要这样做,因此您可能希望将其替换为行为,当生成下一个序列后,将立即对其进行检查.然后,您将提高具有肯定答案的字符串的性能(而不是没有匹配项的字符串的性能).

I'm making a system for my school where we can check if a student is black-listed, at parties and other events. It's easy for me to check if a student is black-listed, since I can just look the student up in my database and see if he/she is black-listed.

Here is where it gets difficult though.

At our parties, each student can invite one person. In theory a student who is black-listed, can be invited by another student and bypass the system. I cannot check the guest table for students black-listed, because only a name is provided when you invite your guest.

So I need to check if a black-listed name is close to a guest name, and display a warning if they are close, unfortunately there are some stuff to take into account.

Names can be quite different. In Denmark, the standard name, contains three "names", like "Niels Faurskov Andersen" But a student may just type "Niels Faurskov" or "Niels Andersen", or even some characters removed.

So a fullname such as Niels Faurskov Andersen could be

  • Niels Andersen
  • Niels Faurskov
  • Niels Faurskov Andersen
  • Nils Faurskov Andersen
  • Nils Andersen
  • niels faurskov
  • niels Faurskov

And so on...

Another thing is that the Danish alphabet contains "æøå" apart from the usual a-z. With that said the whole site and database is UTF-8 encoded.

I've looked into various methods to check the difference between two strings, and the Levenshtein distance doesn't quite do it.

I found this thread on StackOverflow: Getting the closest string match

Which seemed to provided the right data, however I wasn't quite sure what method too choose

I'm coding this part in php, does anybody have an idea how to do this? maybe with MySQL? or a modified version of the Levenshtein distance? Could regex be possible?

解决方案

Introduction

Quite now your matching conditions may be too broad. However, you can use levenshtein distance to check your words. It may be not too easy to fulfill all desired goals with it, like sound similarity. Thus, I'm suggesting to split your issue into some other issues.

For example, you can create some custom checker which will use passed callable input which takes two strings and then answering question about are they same (for levenshtein that will be distance lesser than some value, for similar_text - some percent of similarity e t.c. - it's up to you to define rules).


Similarity, based on words

Well, all of built-in functions will fail if we are talking about case when you're looking for partial match - especially if it's about non-ordered match. Thus, you'll need to create more complex comparison tool. You have:

  • Data string (that will be in DB, for example). It looks like D = D0 D1 D2 ... Dn
  • Search string (that will be user input). It looks like S = S0 S1 ... Sm

Here space symbols means just any space (I assume that space symbols will not affect similarity). Also n > m. With this definition your issue is about - to find set of m words in D which will be similar to S. By set I mean any unordered sequence. Hence, if we'll found any such sequence in D, then S is similar to D.

Obviously, if n < m then input contains more words than data string. In this case you may either think that they are not similar or act like above, but switch data and input (that, however, looks a little bit odd, but is applicable in some sense)


Implementation

To do the stuff, you'll need to be able to create set of string which are parts from m words from D. Based on my this question you can do this with:

protected function nextAssoc($assoc)
{
   if(false !== ($pos = strrpos($assoc, '01')))
   {
      $assoc[$pos]   = '1';
      $assoc[$pos+1] = '0';
      return substr($assoc, 0, $pos+2).
             str_repeat('0', substr_count(substr($assoc, $pos+2), '0')).
             str_repeat('1', substr_count(substr($assoc, $pos+2), '1'));
   }
   return false;
}

protected function getAssoc(array $data, $count=2)
{
   if(count($data)<$count)
   {
      return null;
   }
   $assoc   = str_repeat('0', count($data)-$count).str_repeat('1', $count);
   $result = [];
   do
   {
      $result[]=array_intersect_key($data, array_filter(str_split($assoc)));
   }
   while($assoc=$this->nextAssoc($assoc));
   return $result;
}

-so for any array, getAssoc() will return array of unordered selections consisting from m items each.

Next step is about order in produced selection. We should search both Niels Andersen and Andersen Niels in our D string. Therefore, you'll need to be able to create permutations for array. It's very common issue, but I'll put my version here too:

protected function getPermutations(array $input)
{
   if(count($input)==1)
   {
      return [$input];
   }
   $result = [];
   foreach($input as $key=>$element)
   {
      foreach($this->getPermutations(array_diff_key($input, [$key=>0])) as $subarray)
      {
         $result[] = array_merge([$element], $subarray);
      }
   }
   return $result;
}

After this you'll be able to create selections of m words and then, permutating each of them, get all variants for compare with search string S. That comparison each time will be done via some callback, such as levenshtein. Here's sample:

public function checkMatch($search, callable $checker=null, array $args=[], $return=false)
{
   $data   = preg_split('/\s+/', strtolower($this->data), -1, PREG_SPLIT_NO_EMPTY);
   $search = trim(preg_replace('/\s+/', ' ', strtolower($search)));
   foreach($this->getAssoc($data, substr_count($search, ' ')+1) as $assoc)
   {
       foreach($this->getPermutations($assoc) as $ordered)
       {
           $ordered = join(' ', $ordered);
           $result  = call_user_func_array($checker, array_merge([$ordered, $search], $args));
           if($result<=$this->distance)
           {
               return $return?$ordered:true;
           }
       }
   }
   
   return $return?null:false;
}

This will check on similarity, based on user callback, which must accept at least two parameters (i.e. compared strings). Also you may wish to return string which triggered callback positive return. Please, note, that this code will not differ upper and lower case - but may be you do not want such behavior (then just replace strtolower()).

Sample of full code is available in this listing (I didn't used sandbox since I'm not sure about how long code listing will be available there). With this sample of usage:

$data   = 'Niels Faurskov Andersen';
$search = [
    'Niels Andersen',
    'Niels Faurskov',
    'Niels Faurskov Andersen',
    'Nils Faurskov Andersen',
    'Nils Andersen',
    'niels faurskov',
    'niels Faurskov',
    'niffddels Faurskovffre'//I've added this crap
];

$checker = new Similarity($data, 2);

echo(sprintf('Testing "%s"'.PHP_EOL.PHP_EOL, $data));
foreach($search as $name)
{
   echo(sprintf(
      'Name "%s" has %s'.PHP_EOL, 
      $name, 
      ($result=$checker->checkMatch($name, 'levenshtein', [], 1))
         ?sprintf('matched with "%s"', $result)
         :'mismatched'
      )
   );

}

you'll get result like:

Testing "Niels Faurskov Andersen"

Name "Niels Andersen" has matched with "niels andersen"
Name "Niels Faurskov" has matched with "niels faurskov"
Name "Niels Faurskov Andersen" has matched with "niels faurskov andersen"
Name "Nils Faurskov Andersen" has matched with "niels faurskov andersen"
Name "Nils Andersen" has matched with "niels andersen"
Name "niels faurskov" has matched with "niels faurskov"
Name "niels Faurskov" has matched with "niels faurskov"
Name "niffddels Faurskovffre" has mismatched

-here is demo for this code, just in case.


Complexity

Since you're caring about not just any methods, but also about - how good is it, you may notice, that such code will produce quite excessive operations. I mean, at least, generation of string parts. Complexity here consists of two parts:

  • Strings parts generation part. If you want to generate all string parts - you'll have to do this like I've described above. Possible point to improve - generation of unordered string sets (that comes before permutation). But still I doubt it can be done because method in provided code will generate them not with "brute-force", but as they are mathematically calculated (with cardinality of )
  • Similarity checking part. Here your complexity depends of given similarity checker. For example, similar_text() has O(N3) complexity, thus with large comparison sets it will be extremely slow.

But you still may improve current solution with checking on the fly. Now this code will first generate all string sub-sequences and then start checking them one by one. In common case you don't need to do that, so you may want to replace that with behavior, when after generating next sequence it will be checked immediately. Then you'll increase performance for strings which have positive answer (but not for those which have no match).

这篇关于确定两个名字是否彼此接近的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆