q-gram近似匹配优化 [英] q-gram approximate matching optimisations

查看:478
本文介绍了q-gram近似匹配优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,其中包含300万个人记录,我希望使用q-gram(例如,在姓氏上)执行模糊匹配.我创建了一个2克表链接到此表,但是此数据量(约5分钟)的搜索性能并不理想.

I have a table containing 3 million people records on which I want to perform fuzzy matching using q-grams (on surname for instance). I have created a table of 2-grams linking to this, but search performance is not great on this data volume (around 5 minutes).

我基本上有两个问题: (1)您能否提出任何提高性能的方法以避免表扫描(即必须对搜索字符串和300万个姓氏之间的常见q-gram进行计数) (2)对于q-gram,如果A与B相似,而C与B相似,是否暗示C与A相似?

I basically have two questions: (1) Can you suggest any ways to improve performance to avoid a table scan (i.e. having to count common q-grams between the search string and 3 million surnames) (2) With q-grams, if A is similar to B and C is similar to B, does it imply C is similar to A?

亲切的问候

彼得

推荐答案

我最近一直在研究模糊字符串匹配,因此即使有回答被遗弃问题的风险,这也可以解决.希望对您有用.

I've been looking into fuzzy string matching lately, so even at the risk of answering to an abandoned question, here goes. Hope you find this useful.

我想您只对编辑距离小于给定值的字符串感兴趣.而您的q-gram(或n-gram)看起来像这样

I suppose you're only interested in the strings for which the edit distance is smaller than a given value. And your q-grams (or n-grams) look like this

2-grams for "foobar": {"fo","oo","ob","ba","ar"}

  1. 您可以使用位置 q-gram:

"foobar": {("fo",1),("oo",2),("ob",3),("ba",4),("ar",5)}

位置信息可用于确定是否匹配 q-gram确实是一个不错的选择".

The positional information can be used to determine if a matching q-gram is really a "good match".

例如,如果您要搜索 具有最大编辑距离的"foobar" 之2,这表示您只是 对单词感兴趣的地方

For example, if you're searching for "foobar" with maximum edit distance of 2, this means that you're only interested in words where

2-gram "fo" exists in with position from 1 to 3 or
2-gram "oo" exists in with position from 2 to 4 or
... and so on

字符串"barfoo"没有得到任何结果 之所以匹配,是因为 否则匹配的2克相差 3.

String "barfoo" doesn't get any matches because the positions of the otherwise matching 2-grams differ by 3.

而且,可能使用起来很有用 编辑距离之间的关系 以及匹配的q-gram的计数. 直觉是因为

Also, it might be useful to use the relation between edit distance and the count of matching q-grams. The intution is that since

字符串s具有len-s + 1 q-grams

单个编辑操作最多可以影响q个q-gram,

我们可以推断

字符串s1和s2至少具有 max(len(s1),len(s2))-q + 1-qk匹配非位置q-grams.

strings s1 and s2 within edit distance of d have at least max(len(s1),len(s2))-q+1-qk matching non-positional q-grams.

如果您要搜索"foobar" 最大编辑距离为2, 7个字符的字符串(例如 "fotocar")至少应包含 两个常见的2克.

If you're searching for "foobar" with an maximum edit distance of 2, a matching 7-character string (such as "fotocar") should contain at least two common 2-grams.

请参见 http://pages.stern.nyu.edu/〜panos/publications/deb-dec2001.pdf 了解更多和一些伪SQL.

See http://pages.stern.nyu.edu/~panos/publications/deb-dec2001.pdf for more and some pseudo SQL.

这篇关于q-gram近似匹配优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆