Levenshtein替代品 [英] levenshtein alternative

查看:118
本文介绍了Levenshtein替代品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多查询,并使用levenshtein计算错别字,现在levenshtein导致mysql占用完整的cpu时间. 我的查询是UNION语句中的全文搜索+ levenshtein. sql1是我当前的查询,sql2只是全文搜索,这是快速的,并且不会占用太多的cpu时间,最后一个是leventhein,它将达到峰值!

i have a big set of queries and use levenshtein to calculate typos, now levenshtein causes mysql to take full cpu time. My query is a fulltext search + levenshtein in a UNION statement. sql1 is my current query, sql2 is only fulltext search which is fast and doesnt use too much cpu time, the last one the leventhein one which will peak!

你们中有人有错别字吗? 我已经想到了,请不要回答规范化数据,但是不适用于我的数据,因为我无法预先进行匹配/计算并创建带有索引的单独表.

Any of you have an alternative way to get typos as well? Please don't answer normalize data, I have thought of that, but is not applicable to my data, as I cannot pre-make the matches/calculations and create a separate table with indexes.

            $sql1 = "(SELECT * FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='' AND MATCH(prop_value) AGAINST ('+usama bin laden' IN BOOLEAN MODE)) UNION (SELECT s.* FROM (SELECT levenshtein(prop_value, 'usama bin laden') AS dist, sanction_id, prop_type, prop_value FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='') s WHERE dist < 3) ORDER BY sanction_id";

        $sql2 = "SELECT * FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='' AND MATCH(prop_value) AGAINST ('+usama bin laden' IN BOOLEAN MODE) ORDER BY sanction_id";

        $sql3 = "SELECT s.* FROM (SELECT levenshtein(prop_value, 'usama bin laden') AS dist, sanction_id, prop_type, prop_value FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='') s WHERE dist < 3";

推荐答案

如果仅与MySQL绑定,那么没有简单的解决方案.

If you are tied only to MySQL there is not an easy solution.

通常使用专门的ngram索引进行快速候选者查找过滤,然后仅对10-50个候选者计算levensthein即可解决,这比对所有对计算levensthein的速度更快.

Usually this is solved using specialized ngram indexing for fast candidate lookup filtering and then calculating levensthein only on like 10-50 candidates which is faster that calculating levensthein for all pairs.

内置了诸如Solr/Lucene之类的专业全文搜索引擎.

Specialized fulltext search engines like Solr/Lucene have this built in.

PostgreSQL具有pg_trgm contrib模块(http://www.postgresql.org/docs/9.0/static/pgtrgm.html),它的工作原理很吸引人.

PostgreSQL has pg_trgm contrib module (http://www.postgresql.org/docs/9.0/static/pgtrgm.html) which works like a charm.

您甚至可以使用全文索引在MySQL中进行模拟,但是您必须从所有文档中收集单词,将它们转换为ngram,在它们上创建全文索引,然后将它们全部砍在一起以进行快速查找.冗余带来了各种各样的麻烦,同步...不值得您花费时间.

You can even simulate this in MySQL using fulltext indexing, but you have to collect words from all your documents convert them to ngrams, create fulltext indexes on them, and hack them all together for fast lookup. Which brings all sorts of trouble with redundancy, sync...not worth your time.

这篇关于Levenshtein替代品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆