优化近乎重复的价值搜索 [英] Optimizing near-duplicate value search

查看:77
本文介绍了优化近乎重复的价值搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在一组字段中查找几乎重复的值,以允许管理员清理它们.

I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up.

我要匹配两个条件

  1. 一个字符串完全包含在另一个字符串中,并且至少是其长度的1/4
  2. 这些字符串的编辑距离小于两个字符串总长度的5%

伪PHP代码:

foreach($values as $value){
$matches = array();
foreach($values as $match){
  if(
    (
      $value['length'] < $match['length']
      &&
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      $match['length'] < $value['length']
      &&
      $match['length'] * 4 > $value['length']
      &&
      stripos($value['value'], $match['value']) !== false
    )
    ||
    (
      abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
      &&
      $match['changes'] * 20 <= ($value['length'] + $match['length'])
      )
    ){
      $matches[] = &$match;
    }
}
// output matches for current outer loop value
}

我试图尽可能减少对相对昂贵的striposlevenshtein函数的调用,这大大减少了执行时间. 但是,作为O(n ^ 2)运算,它并不能扩展到更大的值集,并且似乎简单地遍历数组要花费大量的处理时间.

I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit. However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays.

正在操作的几组值的某些属性

Some properties of a few sets of values being operated on


Total   | Strings      | # of matches per string |          |
Strings | With Matches | Average | Median |  Max | Time (s) |
--------+--------------+---------+--------+------+----------+
    844 |          413 |     1.8 |      1 |   58 |    140   |
    593 |          156 |     1.2 |      1 |    5 |     62   | 
    272 |          168 |     3.2 |      2 |   26 |     10   |
    157 |           47 |     1.5 |      1 |    4 |      3.2 |
    106 |           48 |     1.8 |      1 |    8 |      1.3 |
     62 |           47 |     2.9 |      2 |   16 |      0.4 |

我还有什么其他方法可以减少检查标准的时间,更重要的是,我有什么方法可以减少所需的标准检查次数(例如,通过预处理输入值),因为选择性这么低吗?

Are there any other things I can do to reduce the time to check criteria, and more importantly are there any ways for me to reduce the number of criteria checks required (for example, by pre-processing the input values), since there is such low selectivity?

已实施的解决方案

// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
  if(
    (
      $value['length'] * 4 > $match['length']
      &&
      stripos($match['value'], $value['value']) !== false
    )
    ||
    (
      ($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
      &&
      0 < ($changes = levenshtein($value['value'], $match['value']))
      &&
      $changes * 20 <= ($value['length'] + $match['length'])
      )
    ){
      // store match in both directions
      $matches[$vid][$mid] = true;
      $matches[$mid][$vid] = true;
    }

}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
  // sort inner array of matches by usage count with uksort()
  // output matches
}

推荐答案

您可以先按长度对字符串进行排序(O(N)),然后仅将较小的字符串检查为子字符串或较大的字符串,并且仅使用levenshtein进行检查差异不大的字符串对.

You could first order the strings by length ( O(N) ) and then only check smaller strings to be substrings or larger strings, plus only check with levenshtein in string pairs for which the difference is not too large.

您已经执行了这些检查,但是现在您对所有N x N对都进行了检查,而按长度先进行选择将有助于您减少要首先检查的对.即使只包含将失败的测试,也应避免使用N x N循环.

You already perform these checks, but now you do it for all N x N pairs, while preselecting first by length will help you reduce the pairs to check first. Avoid the N x N loop, even if it contains only tests that will fail.

对于子字符串匹配,您可以通过为所有较小项创建索引来进一步改进,并在解析较大项时相应地进行更新.索引应该可以形成分支于字母上的树状结构,其中每个单词(字符串)形成从根到叶的路径.这样,您可以发现索引中的任何单词是否与某个字符串相匹配.对于匹配字符串中的每个字符,请尝试继续使用树索引中的所有指针,并在索引处创建一个新的指针.如果指针不能前进到索引中的下一个字符,则将其删除.如果有任何指针到达叶子的音符,则说明已找到子字符串匹配项. 我认为实现这一点并不困难,但也不是简单的.

For substring matching you could further improve by creating an index for all smaller items, and update this accordingly as you parse larger items. The index should can form a tree structure branching on letters, where each word (string) forms a path from root to leaf. This way you can find if any of the words in the index compare to some string to match. For each character in your match string try to proceed any pointers in the tree index, and create a new pointer at the index. If a pointer can not be proceeded to a following character in the index, you remove it. If any pointer reaches a leaf note, you've found a substring match. Implementing this is, I think, not difficult, but not trivial either.

这篇关于优化近乎重复的价值搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆