SQL 函数 - 使用 Levenshtein 距离算法进行模糊匹配 - 仅返回最低值 [英] SQL Function - Fuzzy Matching with Levenshtein Distance Algorithm - Return Lowest Value Only

查看:38
本文介绍了SQL 函数 - 使用 Levenshtein 距离算法进行模糊匹配 - 仅返回最低值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:需要 SQL 函数使用 Levenshtein 算法返回最低"匹配值.

代码:

<预><代码>创建函数 ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))返回整数作为开始声明 @s1_len int, @s2_len int声明 @i int、@j int、@s1_char nchar、@c int、@c_temp int声明@cv0 varbinary(8000), @cv1 varbinary(8000)选择@s1_len = LEN(@s1),@s2_len = LEN(@s2),@cv1 = 0x0000,@j = 1,@i = 1,@c = 0而@j <= @s2_lenSELECT @cv1 = @cv1 + CAST(@j AS binary(2)),@j = @j + 1当@i <= @s1_len开始选择@s1_char = SUBSTRING(@s1, @i, 1),@c = @i,@cv0 = CAST(@i AS binary(2)),@j = 1当@j <= @s2_len开始设置@c = @c + 1SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +当@s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END如果@c >@c_temp SET @c = @c_tempSET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1如果@c >@c_temp SET @c = @c_tempSELECT @cv0 = @cv0 + CAST(@c AS binary(2)),@j = @j + 1结尾选择@cv1 = @cv0,@i = @i + 1结尾返回@c结尾如果 OBJECT_ID('tempdb..#ExistingCustomers') 不是 NULL删除表#ExistingCustomers;创建表#ExistingCustomers(客户 VARCHAR(255),身份证号码)INSERT #ExistingCustomers SELECT 'Ed''s Barbershop', 1002INSERT #ExistingCustomers SELECT 'GroceryTown', 1003INSERT #ExistingCustomers SELECT 'Candy Place', 1004INSERT #ExistingCustomers SELECT 'Handy Man', 1005如果 OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') 不是 NULL下降表#潜在客户;创建表#POTENTIALCUSTOMERS(客户VARCHAR(255));插入 #POTENTIALCUSTOMERS SELECT 'Eds Barbershop'INSERT #POTENTIALCUSTOMERS SELECT '杂货城'插入 #POTENTIALCUSTOMERS 选择糖果店"INSERT #POTENTIALCUSTOMERS SELECT 'Handee Man'插入 #POTENTIALCUSTOMERS 选择苹果农场"插入 #POTENTIALCUSTOMERS SELECT 'Ride-a-Long Bikes'选择 A. 客户,出价,b.客户作为客户,dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) 作为 ValueLev来自#POTENTIALCUSTOMERS aLEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) <15;

返回:

我想退货:

说明:结果是 Levenshtein 算法的最低"值.有两行 Levenshtein 分数相同 The Apple FarmRide-a-Long Bikes,在这种情况下,任何值都可以,只要它是一种价值.

参考资料:

SQL 模糊连接 - MSSQL

http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx

解决方案

如果您按潜在客户进行分区并使用 ValueLev 对结果进行排序,则可以使用 CTE 来获得您想要的结果:

;with CTE AS(SELECT RANK() OVER (PARTITION BY a.Customer ORDER BY dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) ASC) AS RowNbr,一个客户,出价,b.客户作为客户,dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) 作为 ValueLev来自#POTENTIALCUSTOMERS aLEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) <15)选择客户,MIN(ID) 作为 ID,MIN(cust) AS cust,价值等级从 CTE哪里 CTE.RowNbr = 1按客户分组,ValueLev

由于您不介意在重复 ValueLev 的情况下返回哪个结果,请使用 GROUP BYMIN 来缩放结果每个潜在客户最多一个.

输出:

Customer ID cust ValueLev糖果店 1004 糖果店 0杂货镇 1003 杂货镇 0Eds 理发店 1002 Ed 的理发店 1勤杂工 1005 勤杂工 2苹果农场 1004 Candy Place 9Ride-a-Long Bikes 1003 Candy Place 14

Problem: Need SQL function to return the 'lowest' matching value using the Levenshtein algorithm.

Code:


CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
 DECLARE @s1_len int, @s2_len int
 DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
 DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)

 SELECT
  @s1_len = LEN(@s1),
  @s2_len = LEN(@s2),
  @cv1 = 0x0000,
  @j = 1, @i = 1, @c = 0

 WHILE @j <= @s2_len
  SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1

 WHILE @i <= @s1_len
 BEGIN
  SELECT
   @s1_char = SUBSTRING(@s1, @i, 1),
   @c = @i,
   @cv0 = CAST(@i AS binary(2)),
   @j = 1

  WHILE @j <= @s2_len
  BEGIN
   SET @c = @c + 1
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
    CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
   IF @c > @c_temp SET @c = @c_temp
   SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
   IF @c > @c_temp SET @c = @c_temp
   SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
 END

 SELECT @cv1 = @cv0, @i = @i + 1
 END

 RETURN @c
END




IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
    DROP TABLE #ExistingCustomers;

    CREATE TABLE #ExistingCustomers
(
    Customer VARCHAR(255),
    ID INT
)

INSERT #ExistingCustomers SELECT 'Ed''s Barbershop',  1002
INSERT #ExistingCustomers SELECT 'GroceryTown',  1003
INSERT #ExistingCustomers SELECT 'Candy Place',  1004
INSERT #ExistingCustomers SELECT 'Handy Man',  1005



IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
    DROP TABLE #POTENTIALCUSTOMERS;

CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));

INSERT #POTENTIALCUSTOMERS SELECT 'Eds Barbershop'
INSERT #POTENTIALCUSTOMERS SELECT 'Grocery Town'
INSERT #POTENTIALCUSTOMERS SELECT 'Candy Place'
INSERT #POTENTIALCUSTOMERS SELECT 'Handee Man'
INSERT #POTENTIALCUSTOMERS SELECT 'The Apple Farm'
INSERT #POTENTIALCUSTOMERS SELECT 'Ride-a-Long Bikes'


SELECT A.Customer,
       b.ID,
       b.Customer as cust,
       dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) as ValueLev
FROM #POTENTIALCUSTOMERS a
     LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 15;

This returns:

What I would like to return:

Explanation: The results are the 'lowest' values from the Levenshtein algorithm. There are two rows where the Levenshtein scores are the same The Apple Farm and Ride-a-Long Bikes, in which case any of the values is fine, just as long as it is one value.

References:

SQL Fuzzy Join - MSSQL

http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx

解决方案

You can use CTE to get the result you want if you partition by the potential customer and use the ValueLev to order the results:

;WITH CTE AS
(
    SELECT  RANK() OVER (PARTITION BY a.Customer ORDER BY dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) ASC) AS RowNbr,
            A.Customer,
            b.ID,
            b.Customer as cust,
            dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) as ValueLev
      FROM  #POTENTIALCUSTOMERS a
        LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 15
)
SELECT  Customer,
        MIN(ID) AS ID,
        MIN(cust) AS cust,
        ValueLev
  FROM  CTE
  WHERE CTE.RowNbr = 1
  GROUP BY Customer, ValueLev

As you don't mind which result is returned in the case of duplicate ValueLev, use GROUP BY and MIN to scale the results down to one per potential customer.

Output:

Customer            ID      cust            ValueLev
Candy Place         1004    Candy Place     0
Grocery Town        1003    GroceryTown     0
Eds Barbershop      1002    Ed's Barbershop 1
Handee Man          1005    Handy Man       2
The Apple Farm      1004    Candy Place     9
Ride-a-Long Bikes   1003    Candy Place     14

这篇关于SQL 函数 - 使用 Levenshtein 距离算法进行模糊匹配 - 仅返回最低值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆