如何优化此 Levenshtein 距离计算 [英] How to optimize this Levenshtein distance calculation

查看:35
本文介绍了如何优化此 Levenshtein 距离计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Table a 大约有 8,000 行,table b 大约有 250,000 行.如果没有 levenshtein 函数,查询只需不到 2 秒.包含该功能后,大约需要 25 分钟.

Table a has around 8,000 rows and table b has around 250,000 rows. Without the levenshtein function the query takes just under 2 seconds. With the function included it is taking about 25 minutes.

SELECT
      *
   FROM
      library a,
      classifications b
   WHERE  
      a.`release_year` = b.`year`
      AND a.`id` IS NULL
      AND levenshtein_ratio(a.title, b.title) > 82

推荐答案

我假设 levenshtein_ratio 是您编写的函数(或者可能包含在其他地方).如果是这样,数据库服务器将无法在使用索引的正常意义上对其进行优化.所以这意味着它只需要为由其他连接条件产生的每个记录调用它.对于内部联接,对于那些表大小(最大 8000*250000 = 20 亿),这可能是一个非常大的数字.您可以使用以下命令检查需要调用的总次数:

I'm assuming that levenshtein_ratio is a function that you wrote (or maybe included from somewhere else). If so, the database server would not be able to optimize that in the normal sense of using an index. So it means that it simply needs to call it for each record that results from the other join conditions. With an inner join, that could be an extremely large number with those table sizes (a maximum of 8000*250000 = 2 billion). You can check the total number of times it would need to be called with this:

SELECT
      count(*)
   FROM
      library a,
      classifications b
   WHERE  
      a.`release_year` = b.`year`
      AND a.`id` IS NULL

这是为什么它很慢的解释(并不是对如何优化它的问题的真正答案).要优化它,您可能需要向连接条件添加额外的限制因素,以减少对用户定义函数的调用次数.

That is an explanation of why it is slow (not really an answer to the question of how to optimize it). To optimize it, you likely need to add additional limiting factors to the join condition to reduce the number of calls to the user-defined function.

这篇关于如何优化此 Levenshtein 距离计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆