MySQL的混合Damerau - 莱文斯坦模糊有了这样的通配符 [英] MySQL Mixing Damerau–Levenshtein Fuzzy with Like Wildcard

查看:183
本文介绍了MySQL的混合Damerau - 莱文斯坦模糊有了这样的通配符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近实施的Damerau - 莱文斯坦算法的UDF到MySQL,并想知道如果有一种方式来Damerau-Levenshtein算法与通配符搜索类似功能的模糊匹配相结合?如果我有以下表中的数据:

I recently implemented the UDFs of the Damerau–Levenshtein algorithms into MySQL, and was wondering if there is a way to combine the fuzzy matching of the Damerau–Levenshtein algorithm with the wildcard searching of the Like function? If I have the following data in a table:

ID | Text
---------------------------------------------
1  | let's find this document
2  | let's find this docment
3  | When the book is closed
4  | The dcument is locked

我要运行,将纳入Damerau-Levenshtein算法查询...

I want to run a query that would incorporate the Damerau–Levenshtein algorithm...

select text from table where damlev('Document',tablename.text) <= 5;

...用通配符匹配返回ID的1,2,4我的查询。我不知道语法,或者如果这是可能的,还是我将不得不以不同的方法这一点。上面的select语句工作正常issolation,但不工作的单个单词。我将不得不改变上面的SQL来...

...with a wildcard match to return IDs 1, 2, and 4 in my query. I'm not sure of the syntax or if this is possible, or whether I would have to approach this differently. The above select statement works fine in issolation, but is not working on individual words. I would have to change the above SQL to...

select text from table where 
 damlev('let's find this document',tablename.text) <= 5;

...这当然回报只是ID 2.我希望有一种方法来模糊和通配符结合在了一起,如果我想所有记录返回包含单词文档或它的内无论如何出现变化文本字段。

...which of course returns just ID 2. I'm hoping there is a way to combine the fuzzy and wildcard together if I want all records returned that have the word "document" or variations of it appearing anyway within the Text field.

推荐答案

在与人的名字的时候,做模糊查找他们,什么工作对我来说是创造单词的第二个表。同时创建第三个表是许多包含文本的表,字表之间有很多关系的交叉表。当行被添加到文本表,拆分文本的话,适当填充交叉表,在需要的时候增加新词的字表。一旦这种结构的地方,你可以查找快一点,因为你只需要在特定单词表来执行damlev功能。一个简单的加入让你包含匹配的单词的文本。

In working with person names, and doing fuzzy lookups on them, what worked for me was to create a second table of words. Also create a third table that is an intersect table for the many to many relationship between the table containing the text, and the word table. When a row is added to the text table, you split the text into words and populate the intersect table appropriately, adding new words to the word table when needed. Once this structure is in place, you can do lookups a bit faster, because you only need to perform your damlev function over the table of unique words. A simple join gets you the text containing the matching words.

有关单个词匹配的查询会是这个样子:

A query for a single word match would look something like this:

SELECT T.* FROM Words AS W
JOIN Intersect AS I ON I.WordId = W.WordId
JOIN Text AS T ON T.TextId = I.TextId
WHERE damlev('document',W.Word) <= 5 

和两个词是这样的(从我的头顶,所以可能不完全正确):

and two words would look like this (off the top of my head, so may not be exactly correct):

SELECT T.* FROM Text AS T
JOIN (SELECT I.TextId, COUNT(I.WordId) AS MatchCount FROM Word AS W
      JOIN Intersect AS I ON I.WordId = W.WordId
      WHERE damlev('john',W.Word) <= 2
            OR damlev('smith',W.Word) <=2
      GROUP BY I.TextId) AS Matches ON Matches.TextId = T.TextId
          AND Matches.MatchCount = 2

这里的优势,在某些数据库空间的成本,就是你只需要申请的时间昂贵damlev功能以独特的话,将在10的数千大概只有数量,而不管你的表的大小文本。这一点很重要,因为damlev UDF将不使用的索引 - 它会扫描在其上它的施加来计算每一行的值的整个表。扫描只是唯一字要快很多。另一个优点是damlev加在词汇层面,这似乎是你所要求的。另外一个好处是,你可以展开查询,支持搜索多个单词,并可以通过分组交叉上TEXTID行匹配,并在比赛的数量位居排名的结果。

The advantages here, at the cost of some database space, is that you only have to apply the time-expensive damlev function to the unique words, which will probably only number in the 10's of thousands regardless of the size of your table of text. This matters, because the damlev UDF will not use indexes - it will scan the entire table on which it's applied to compute a value for every row. Scanning just the unique words should be much faster. The other advantage is that the damlev is applied at the word level, which seems to be what you are asking for. Another advantage is that you can expand the query to support searching on multiple words, and can rank the results by grouping the matching intersect rows on TextId, and ranking on the count of matches.

这篇关于MySQL的混合Damerau - 莱文斯坦模糊有了这样的通配符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆