获取匹配字符串的百分比 [英] Get percentage of matching strings

查看:34
本文介绍了获取匹配字符串的百分比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个字符串要匹配并得到匹配的百分比.

I have two string to match and get the percentage of matching.

给定:

String 1: John Smith Makde

String 2: Makde John Smith   

使用了以下用户定义的标量函数.

Used the following user defined scalar function.

CREATE FUNCTION [dbo].[udf_GetPercentageOfTwoStringMatching]
(
    @string1 NVARCHAR(1000)
    ,@string2 NVARCHAR(1000)
)
RETURNS INT

--WITH ENCRYPTION 
AS
BEGIN

    DECLARE @levenShteinNumber INT

    DECLARE @string1Length INT = LEN(@string1), @string2Length INT = LEN(@string2)
    DECLARE @maxLengthNumber INT = CASE WHEN @string1Length > @string2Length THEN @string1Length ELSE @string2Length END

    SELECT @levenShteinNumber = [dbo].[f_ALGORITHM_LEVENSHTEIN] (@string1  ,@string2)

    DECLARE @percentageOfBadCharacters INT = @levenShteinNumber * 100 / @maxLengthNumber

    DECLARE @percentageOfGoodCharacters INT = 100 - @percentageOfBadCharacters

    -- Return the result of the function
    RETURN @percentageOfGoodCharacters

END

函数调用:

SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','Makde John Smith')    

输出:

7

但是当我给两个字符串相同的位置时:

But when I give both the string as same with same position:

SELECT dbo.f_GetPercentageOfTwoStringMatching('John Smith Makde','John Smith Makde')

输出:

100

预期结果:由于两个字符串单词相同但顺序不同,我希望 100% 匹配百分比.

Expected Result: As the both strings words are same but with different sequence I want 100% matching percentage.

100

推荐答案

一次又一次地为数百万行这样做将是一场噩梦......我会添加另一列(或 1:1 相关的边表)来永久存储 规范化 字符串.试试这个:

Doing this for millions of rows again and again will be a nightmare... I'd add another column (or a 1:1 related side table) to permantently store a normalized string. Try this:

--创建一个模型表并用一些虚拟数据填充它

--Create a mockup table and fill it with some dummy data

CREATE TABLE #MockUpYourTable(ID INT IDENTITY, SomeName VARCHAR(1000));
INSERT INTO #MockUpYourTable VALUES('Makde John Smith')
                                  ,('Smith John Makde')
                                  ,('Some other string')
                                  ,('string with with duplicates with');
GO

--添加一列来存储规范化的字符串

--Add a column to store the normalized strings

ALTER TABLE #MockupYourTable ADD NormalizedName VARCHAR(1000);
GO

--使用此脚本将字符串拆分为片段,并将它们重新连接为规范有序的无重复字符串.

--Use this script to split your string in fragments and re-concatenate them as canonically ordered, duplicate-free string.

UPDATE #MockUpYourTable SET NormalizedName=CAST('<x>' + REPLACE((SELECT LOWER(SomeName) AS [*] FOR XML PATH('')),' ','</x><x>') + '</x>' AS XML)
                                            .query(N'
                                                    for $fragment in distinct-values(/x/text())
                                                    order by $fragment
                                                    return $fragment
                                                    ').value('.','nvarchar(1000)');           
GO

--检查结果

SELECT * FROM #MockUpYourTable

ID  SomeName                            NormalizedName
----------------------------------------------------------
1   Makde John Smith                    john makde smith
2   Smith John Makde                    john makde smith
3   Some other string                   other some string
4   string with with duplicates with    duplicates string with

--Clean-Up
GO
DROP TABLE #MockUpYourTable

提示使用触发器ON INSERT, UPDATE来保持这些值同步.

Hint Use a trigger ON INSERT, UPDATE to keep these values synced.

现在您可以对您希望与之前的方法进行比较的字符串使用相同的转换.由于重新排序,相同的片段将返回100%的相似度.

Now you can use the same transformation against your strings you want this to compare with and use your former approach. Due to the re-sorting, identical fragments will return 100% similarity.

这篇关于获取匹配字符串的百分比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆