如何找到相似的结果并按相似性排序? [英] How to find similar results and sort by similarity?
问题描述
如何查询按相似性排序的记录?
例如.搜索库存溢出"将返回
Eg. searching for "Stock Overflow" would return
- 堆栈溢出
- SharePoint溢出
- 数学溢出
- 政治溢出
- VFX溢出
例如.搜索"LO"将返回:
Eg. searching for "LO" would return:
- pabLO毕加索
- michelangeLO
- jackson polLOck
我需要什么帮助:
-
使用搜索引擎为&搜索MySQL表以获得更好的结果
Using a search engine to index & search a MySQL table, for better results
-
使用 Sphinx 搜索引擎以及PHP
Using the Sphinx search engine, with PHP
在PHP中使用 Lucene 引擎
使用全文索引,查找相似/包含的字符串
Using full-text indexing, to find similar/containing strings
什么不好用
- Levenshtein距离非常不稳定. ( UDF ,查询)
搜索狗"给我:
What does not work well
- Levenshtein distance is very erratic. (UDF, Query)
Searching for "dog" gives me:- 狗
- 沼泽
- ago
- 大
- 回声
-
LIKE
返回更好的结果,但是长查询没有返回任何结果,尽管确实存在相似的字符串 LIKE
returns better results, but returns nothing for long queries although similar strings do exist- 狗
- 狗狗
- 多加拉尔
- 教条
推荐答案
我发现,当您针对一个完整的字符串搜索另一个完整的字符串时,但是当您在一个字符串中查找关键字时,Levenshtein距离可能很好,此方法不会返回(有时)所需的结果.此外,SOUNDEX函数不适用于英语以外的其他语言,因此非常有限.您可以通过LIKE摆脱困境,但这实际上是针对基本搜索的.您可能想研究其他搜索方法以获取想要的结果.例如:
I have found out that the Levenshtein distance may be good when you are searching a full string against another full string, but when you are looking for keywords within a string, this method does not return (sometimes) the wanted results. Moreover, the SOUNDEX function is not suitable for languages other than english, so it is quite limited. You could get away with LIKE, but it's really for basic searches. You may want to look into other search methods for what you want to achieve. For example:
您可以使用 Lucene 作为项目的搜索基础.它以大多数主要的编程语言实现,并且非常快速且通用.该方法可能是最好的,因为它不仅搜索子字符串,而且搜索字母转置,前缀和后缀(全部组合).但是,您需要保留一个单独的索引(尽管有时可以使用CRON从独立脚本中对其进行更新).
You may use Lucene as search base for your projects. It's implemented in most major programming languages and it'd quite fast and versatile. This method is probably the best, as it not only search for substrings, but also letter transposition, prefixes and suffixes (all combined). However, you need to keep a separate index (using CRON to update it from a independent script once in a while works though).
或者,如果您想要MySQL解决方案,则全文功能非常好,并且肯定比存储过程快.如果您的表不是MyISAM,则可以创建一个临时表,然后执行全文搜索:
Or, if you want a MySQL solution, the fulltext functionality is pretty good, and certainly faster than a stored procedure. If your tables are not MyISAM, you can create a temporary table, then perform your fulltext search :
CREATE TABLE IF NOT EXISTS `tests`.`data_table` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `title` varchar(2000) CHARACTER SET latin1 NOT NULL, `description` text CHARACTER SET latin1 NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
如果您不想创建数据,请使用数据生成器生成一些随机数据你自己...
Use a data generator to generate some random data if you don't want to bother creating it yourself...
** 注意 **:列类型应为
latin1_bin
,以执行区分大小写的搜索,而不是对latin1
不区分大小写的搜索.对于unicode字符串,我建议utf8_bin
用于区分大小写,utf8_general_ci
用于不区分大小写的搜索.** NOTE ** : the column type should be
latin1_bin
to perform a case sensitive search instead of case insensitive withlatin1
. For unicode strings, I would recommendutf8_bin
for case sensitive andutf8_general_ci
for case insensitive searches.DROP TABLE IF EXISTS `tests`.`data_table_temp`; CREATE TEMPORARY TABLE `tests`.`data_table_temp` SELECT * FROM `tests`.`data_table`; ALTER TABLE `tests`.`data_table_temp` ENGINE = MYISAM; ALTER TABLE `tests`.`data_table_temp` ADD FULLTEXT `FTK_title_description` ( `title` , `description` ); SELECT *, MATCH (`title`,`description`) AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) as `score` FROM `tests`.`data_table_temp` WHERE MATCH (`title`,`description`) AGAINST ('+so* +nullam lorem' IN BOOLEAN MODE) ORDER BY `score` DESC; DROP TABLE `tests`.`data_table_temp`;
Read more about it from the MySQL API reference page
缺点是它不会寻找字母转置或相似的,听起来像"的单词.
The downside to this is that it will not look for letter transposition or "similar, sounds like" words.
** 更新 **
使用Lucene进行搜索,您只需要创建一个cron作业(所有Web主机都具有此功能"),该作业将仅执行一个PHP脚本(例如,"cd/path/to/script; php searchindexer") .php"),将更新索引.原因是索引成千上万的文档"(行,数据等)可能需要几秒钟,甚至是几分钟,但这是为了确保所有搜索都尽可能快地执行.因此,您可能需要创建要由服务器运行的延迟作业.可能是一夜之间,也可能是下一小时,这取决于您. PHP脚本应如下所示:
Using Lucene for your search, you will simply need to create a cron job (all web hosts have this "feature") where this job will simply execute a PHP script (i.g. "cd /path/to/script; php searchindexer.php") that will update the indexes. The reason being that indexing thousands of "documents" (rows, data, etc.) may take several seconds, even minutes, but this is to ensure that all searches are performed as fast as possible. Therefore, you may want to create a delay job to be run by the server. It may be overnight, or in the next hour, this is up to you. The PHP script should look something like this:
$indexer = Zend_Search_Lucene::create('/path/to/lucene/data'); Zend_Search_Lucene_Analysis_Analyzer::setDefault( // change this option for your need new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive() ); $rowSet = getDataRowSet(); // perform your SQL query to fetch whatever you need to index foreach ($rowSet as $row) { $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::text('field1', $row->field1, 'utf-8')) ->addField(Zend_Search_Lucene_Field::text('field2', $row->field2, 'utf-8')) ->addField(Zend_Search_Lucene_Field::unIndexed('someValue', $someVariable)) ->addField(Zend_Search_Lucene_Field::unIndexed('someObj', serialize($obj), 'utf-8')) ; $indexer->addDocument($doc); } // ... you can get as many $rowSet as you want and create as many documents // as you wish... each document doesn't necessarily need the same fields... // Lucene is pretty flexible on this $indexer->optimize(); // do this every time you add more data to you indexer... $indexer->commit(); // finalize the process
然后,这基本上就是您进行搜索(基本搜索)的方式:
Then, this is basically how you search (basic search) :
$index = Zend_Search_Lucene::open('/path/to/lucene/data'); // same search options Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8Num_CaseInsensitive() ); Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8'); $query = 'php +field1:foo'; // search for the word 'php' in any field, // +search for 'foo' in field 'field1' $hits = $index->find($query); $numHits = count($hits); foreach ($hits as $hit) { $score = $hit->score; // the hit weight $field1 = $hit->field1; // etc. }
总结:每种搜索方法各有利弊:
In conclusion each search methods have their own pros and cons :
- 您提到 Sphinx搜索,它看起来非常好,只要您可以在您的网站上运行恶魔网络主机.
- Zend Lucene需要执行cron作业才能重新索引数据库.尽管对用户来说是完全透明的,但这意味着任何新数据(或已删除的数据!)并不总是与数据库中的数据同步,因此不会在用户搜索时立即显示.
- MySQL FULLTEXT搜索既好又快速,但不能提供前两个功能和灵活性.
- You mentioned Sphinx search and it looks very good, as long as you can make the deamon run on your web host.
- Zend Lucene requires a cron job to re-index the database. While it is quite transparent to the user, this means that any new data (or deleted data!) is not always in sync with the data in your database and therefore won't show up right away on user search.
- MySQL FULLTEXT search is good and fast, but will not give you all the power and flexibility of the first two.
如果我忘记/遗漏了任何东西,请随时发表评论.
Please feel free to comment if I have forgotten/missed anything.
这篇关于如何找到相似的结果并按相似性排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- Levenshtein distance is very erratic. (UDF, Query)