Solr中的近似重复检测 [英] Near duplicate detection in Solr

查看:161
本文介绍了Solr中的近似重复检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Solr用于搜索用户生成的列表的数据库。这些列表通过DataImportHandler从MySQL导入Solr。

Solr is being used to search through a database of user-generated listings. These listings are imported into Solr from MySQL via the DataImportHandler.

问题:用户常常向数据库报告相同的列表,有时候对他们的上市信息进行微小修改,以避免被轻易地检测为重复的帖子。

Problem: Quite often, users report the same listing to the database, sometimes with minor changes to their listing post to avoid being easily detected as a duplicate post.

如何使用Solr实现近重复检测?只要搜索结果不包含这些近乎重复的列表,我不介意在Solr索引中存在近似重复的列表。

How should I implement a near-duplication detection with Solr? I do not mind having near-duplicate listings in the Solr index as long as the search results do not contain these near-duplicate listings.

我想有4个可能的地方做这个近乎重复的检测

I guess there are 4 possible places to do this near-duplicate detection


  1. 当用户提交列表(PHP在这里使用)

  2. 从MySQL到Solr的数据导入

  3. 从MySQL导入数据

  4. 当搜索是完成

  1. When the user submits the listing (PHP is being used here)
  2. During the data import from MySQL to Solr
  3. After the data import from MySQL
  4. When a search is being done

推荐的方法是什么?谢谢!

What is the recommended way to do this? Thank you!

推荐答案

我不熟悉Solr,我会在用户提交的时候实现重复数据删除清单。已经有不同的算法来检测近似重复的内容,例如 Jaccard Indexing

i'm not familiar with Solr, i would implement the "near-duplication" when the user submits the listing. There are quit different algorithms to detect near-duplicates like the Jaccard Indexing.

我做了一个小脚本来看到相似系数之间的差异:

I made a little script to see the difference between the similarity coefficients:

<?php

$input1 = "Hello there, this is a test 1, you see it's almost the same";
$input2 = "Hello there, this is a test 2, you saw it, it's almost the same";
$input3 = "this is very different from the others, but who knows ?";

echo jackard($input1, $input1) . "<br />"; // results 1

echo jackard($input1, $input2) . "<br />"; // results 0.81481481481481

echo jackard($input1, $input3) . "<br />"; // results 0.25

echo jackard($input2, $input3); // results 0.24


function jackard($a, $b){
    $a_arr = explode(" ", $a);
    $b_arr = explode(" ", $b);
    $intersect_a_b = array_intersect($a_arr,$b_arr);
    return((count($intersect_a_b)/(count($a_arr)+count($b_arr)))*2);
}
?>

你可能会看到,如果结果是1,这意味着它是相同的句子或者它使用相同的单词在不同的顺序。
然而,值越小,句就越独特。这是一个简单的实现。您可以设置例如0.4的极限值。如果通过此限制,则将请求设置为队列。然后在列表中看看手动。这不是高效的。但我给了你这个想法,由你开发一个更复杂和自动的系统/算法。也许你也应该看看 here

You may see, that if the result is 1, it means that it's the same sentence OR it uses the same words in a different order. However, the smaller the value is, the more unique the "sentence" is. This is rather a simple implementation. You may set a limit value for example 0.4. And set the "request" in a queue if it passes this limit. And then take a look manualy at the listing. This is not "efficient". But i gave you the idea, and it's up to you to develop a more complex and automated system/algorithm. And maybe you should also take a look here.

这篇关于Solr中的近似重复检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆