SQL模糊匹配 [英] SQL Fuzzy Matching

查看:154
本文介绍了SQL模糊匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

希望我不会重复这个问题。我在这里做了一些搜索和谷歌之前发布在这里。



我运行SQL Server 2008R2 eStore启用全文。



我的要求




  1. 有一个产品表,其中包含产品名称,OEM代码, 。全部都是文字。

  2. 我创建了一个名为TextSearch的新列。这将产品名称,OEM代码和型号的值链接起来,这些产品适用于这些值。这些值以逗号分隔。

  3. 当客户输入关键字时,我们在TextSearch列上运行搜索以匹配为产品。请参阅下面的匹配逻辑。

我正在使用Hybrid Fulltext和normal来进行搜索。这给出了更相关的结果。所有执行到临时表中的查询都会返回。



匹配逻辑,


  1. 运行以下SQL以使用全文获取相关产品。但@Keywords将被预处理。说'CLC 2200'将改为'CLC *和2200 *'



    SELECT ID FROM dbo.Product WHERE CONTAINS(TextSearch,@ Keywords)


  2. 另一个查询将使用正常运行。因此'CLC 2200'将被预处理为'TextSearch',如%clc%和TextSearch'%2200%'。这是因为全文搜索不会在关键字之前搜索模式。例如,它不会返回'pclc 2200'。



    SELECT ID FROM dbo.Product WHERE TextSearch like'%clc%'AND TextSearch like'%2200%'


  3. 如果第1步和第2步没有返回任何记录,则将执行以下搜索。值135是由我调整返回更多相关记录。



    SELECT p.id FROM dbo.Product AS p INNER JOIN FREETEXTTABLE(product,TextSearch,@ Keywords)AS r ON p.Id = r。[KEY] WHERE r.RANK> 135


以上所有组合均可正常工作在合理的速度和返回相关产品的关键字。



但我期待进一步改善,当没有产品发现。



说客户是否在寻找'CLC 2200npk'并且这个产品不在那里,我需要'CLC 2200'非常接近。



到目前为止,我尝试使用 Soundex()函数。购买TextSearch列中每个单词的计算soundex值,并与关键字的soudex值进行比较。但是,这会返回太多记录,并且速度也会变慢。

例如,'CLC 2200npk'将返回'CLC 1100'等产品。但这不会是好结果。因为它不接近CLC 2200npk



还有一个很好的此处。但是这使用CLR函数。但是我无法在服务器上安装CLR函数。



所以我的逻辑需要是



if 'CLC 2200npk'找不到'CLC 2200'
'CLC 2200'找不到'CLC 1100'


  1. 是否可以按照建议匹配?
  2. 如果我需要拼写更正和搜索,会是什么好方法?

  3. 是否有任何UDF或SP与我的建议相匹配的文本?

谢谢。

解决方案

一个相当快速的域特定解决方案可能是使用SOUNDEX和a两串之间的数字距离。这只会在你有很多产品代码时真正起作用。



使用下面这样简单的UDF,你可以从字符串中提取数字字符,然后你就可以得到'CLC 2200npk'为2200,'CLC 1100'为1100,因此您现在可以根据每个输入的SOUNDEX输出以及每个输入的数字分量的接近度来确定接近度。

  CREATE Function [dbo]。[ExtractNumeric](@ input VARCHAR(1000))
RETURNS INT
AS
BEGIN
WHILE PATINDEX('%[^ 0-9]%',@input)> 0
BEGIN
SET @input = STUFF(@input,PATINDEX('%[^ 0-9]%',@input),1,'')
END
IF @input =''OR @input IS NULL
SET @input ='0'
RETURN CAST(@input AS INT)
END
GO

就通用算法而言,有一对夫妇可以根据数据集大小和性能帮助您获得不同程度的成功要求。 (两个链接都有TSQL实现)





这里是一个有趣的文章,这两个算法适用于两个算法可能会给你一些想法。

那么希望有一些帮助。



编辑:这里是一个更快的部分Levenshtein距离实现(阅读后它不会返回与正常结果完全相同的结果)。在125000行的测试表上,它运行时间为6秒,而第一个链接的时间为60秒。


Hope i am not repeating this question. I did some search here and google before posting here.

I am running a eStore with SQL Server 2008R2 with Full Text enabled.

My requirements,

  1. There is a Product Table, which has product name, OEM Codes, Model which this product fits into. All are in text.
  2. I have created a new column called TextSearch. This has concatenated values of Product Name, OEM Code and Model which this product fits in. These values are comma separated.
  3. When a customer enters a keyword, we run search on TextSearch column to match for products. See matching logic below.

I am using a Hybrid Fulltext and normal like to do search. This gives more relevant results. All the queries executed in to a temp table and distincts were returned.

Matching logic,

  1. Run following SQL to get relevant product using full text. But @Keywords will be pre-processed. Say 'CLC 2200' will be changed to 'CLC* AND 2200*'

    SELECT Id FROM dbo.Product WHERE CONTAINS (TextSearch ,@Keywords)

  2. Another query will be running using normal like. So 'CLC 2200' will be pre-processed to 'TextSearch like %clc% AND TextSearch like %2200%'. This is simply because full text search wont search patterns before the keywords. example, it wont return 'pclc 2200'.

    SELECT Id FROM dbo.Product WHERE TextSearch like '%clc%' AND TextSearch like '%2200%'

  3. If step 1 and 2 didn't return any records, following search will be executed. Value 135 was fine tuned by me to return more relevant records.

    SELECT p.id FROM dbo.Product AS p INNER JOIN FREETEXTTABLE(product,TextSearch,@Keywords) AS r ON p.Id = r.[KEY] WHERE r.RANK > 135

All of above combined works fine in a reasonable speed and returns relevant products for keywords.

But i am looking for to further improve when there is no product found at all.

Say if customer looks for 'CLC 2200npk' and this product wasn't there, i needed to show next very close by 'CLC 2200'.

So far i tried using Soundex() function. Buy computing soundex value for each word in TextSearch column and comparing with soudex value of keyword. But this returns way too many records and slow too.

example, 'CLC 2200npk' will return products such as 'CLC 1100' etc. But this wouldn't be a good result. As it is not close to CLC 2200npk

There is another good one here. but this uses CLR Functions. But i can not install CLR functions on the server.

So my logic would need to be,

if 'CLC 2200npk' not found, show close by 'CLC 2200' if 'CLC 2200' not found, show next close by 'CLC 1100'

Questions

  1. Is it possible to match like as suggested?
  2. If i would need to do spelling correction and search, what would be good way? All of our product listing is in English.
  3. Is there any UDF or SP's to match texts like my suggestions?

Thanks.

解决方案

A rather quick domain specific solution may be to calculate a string similarity using SOUNDEX and a numeric distance between 2 strings. This will only really help when you have a lot of product codes.

Using a simple UDF like below you can extract the numeric chars from a string so that you can then get 2200 out of 'CLC 2200npk' and 1100 out of 'CLC 1100' so you can now determine closeness based on the SOUNDEX output of each input as well as closeness of the numeric component of each input.

CREATE Function [dbo].[ExtractNumeric](@input VARCHAR(1000))
RETURNS INT
AS
BEGIN
    WHILE PATINDEX('%[^0-9]%', @input) > 0
    BEGIN
        SET @input = STUFF(@input, PATINDEX('%[^0-9]%', @input), 1, '')
    END
    IF @input = '' OR @input IS NULL
        SET @input = '0'
    RETURN CAST(@input AS INT)
END
GO

As far as general purpose algorithms go there are a couple which might help you with varying degrees of success depending on data set size and performance requirements. (both links have TSQL implementations available)

  • Double Metaphone - This algo will give you a better match than soundex at the cost of speed it is really good for spelling correction though.
  • Levenshtein Distance - This will calculate how many keypresses it would take to turn one string into another for instance to get from 'CLC 2200npk' to 'CLC 2200' is 3, while from 'CLC 2200npk' to 'CLC 1100' is 5.

Here is an interesting article which applies both algos together which may give you a few ideas.

Well hopefully some of that helps a little.

EDIT: Here is a much faster partial Levenshtein Distance implementation (read the post it wont return exact same results as the normal one). On my test table of 125000 rows it runs in 6 seconds compared to 60 seconds for the first one I linked to.

这篇关于SQL模糊匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆