Solr中的近似重复检测 [英] Near duplicate detection in Solr
问题描述
Solr用于搜索用户生成的列表的数据库。这些列表通过DataImportHandler从MySQL导入Solr。
Solr is being used to search through a database of user-generated listings. These listings are imported into Solr from MySQL via the DataImportHandler.
问题:用户常常向数据库报告相同的列表,有时候对他们的上市信息进行微小修改,以避免被轻易地检测为重复的帖子。
Problem: Quite often, users report the same listing to the database, sometimes with minor changes to their listing post to avoid being easily detected as a duplicate post.
如何使用Solr实现近重复检测?只要搜索结果不包含这些近乎重复的列表,我不介意在Solr索引中存在近似重复的列表。
How should I implement a near-duplication detection with Solr? I do not mind having near-duplicate listings in the Solr index as long as the search results do not contain these near-duplicate listings.
我想有4个可能的地方做这个近乎重复的检测
I guess there are 4 possible places to do this near-duplicate detection
- 当用户提交列表(PHP在这里使用)
- 从MySQL到Solr的数据导入
- 从MySQL导入数据
- 当搜索是完成
- When the user submits the listing (PHP is being used here)
- During the data import from MySQL to Solr
- After the data import from MySQL
- When a search is being done
推荐的方法是什么?谢谢!
What is the recommended way to do this? Thank you!
推荐答案
我不熟悉Solr,我会在用户提交的时候实现重复数据删除清单。已经有不同的算法来检测近似重复的内容,例如 Jaccard Indexing 。
i'm not familiar with Solr, i would implement the "near-duplication" when the user submits the listing. There are quit different algorithms to detect near-duplicates like the Jaccard Indexing.
我做了一个小脚本来看到相似系数之间的差异:
I made a little script to see the difference between the similarity coefficients:
<?php
$input1 = "Hello there, this is a test 1, you see it's almost the same";
$input2 = "Hello there, this is a test 2, you saw it, it's almost the same";
$input3 = "this is very different from the others, but who knows ?";
echo jackard($input1, $input1) . "<br />"; // results 1
echo jackard($input1, $input2) . "<br />"; // results 0.81481481481481
echo jackard($input1, $input3) . "<br />"; // results 0.25
echo jackard($input2, $input3); // results 0.24
function jackard($a, $b){
$a_arr = explode(" ", $a);
$b_arr = explode(" ", $b);
$intersect_a_b = array_intersect($a_arr,$b_arr);
return((count($intersect_a_b)/(count($a_arr)+count($b_arr)))*2);
}
?>
你可能会看到,如果结果是1,这意味着它是相同的句子或者它使用相同的单词在不同的顺序。
然而,值越小,句就越独特。这是一个简单的实现。您可以设置例如0.4的极限值。如果通过此限制,则将请求设置为队列。然后在列表中看看手动。这不是高效的。但我给了你这个想法,由你开发一个更复杂和自动的系统/算法。也许你也应该看看 here 。
You may see, that if the result is 1, it means that it's the same sentence OR it uses the same words in a different order. However, the smaller the value is, the more unique the "sentence" is. This is rather a simple implementation. You may set a limit value for example 0.4. And set the "request" in a queue if it passes this limit. And then take a look manualy at the listing. This is not "efficient". But i gave you the idea, and it's up to you to develop a more complex and automated system/algorithm. And maybe you should also take a look here.
这篇关于Solr中的近似重复检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!