根据人员姓名匹配记录 [英] Matching records based on Person Name

查看:164
本文介绍了根据人员姓名匹配记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在两个不同的数据源之间,是否有任何工具或方法可用于按人的名字进行匹配?

Are there any tools or methods that can be used for matching by a person's name between two different data sources?

系统没有其他公共信息,并且在许多情况下输入的名称不同.

The systems have no other common information and the names have been entered differently in many cases.

不完全匹配的示例:

小国王马丁·路德=马丁国王(不包括后缀)
Erving,J.博士= Erving,J.(不包括前缀)
奥巴马,巴拉克·侯赛因=奥巴马,巴拉克(不包括中间名)
Pufnstuf,H.R. = Pufnstuf,Haibane Renmei(比赛缩写)
Tankengine,托马斯= Tankengine,汤姆(匹配常用昵称)
Flair,Rick"The Natureboy" = Flair,Natureboy(昵称匹配)

King Jr., Martin Luther = King, Martin (exclude suffix)
Erving, Dr. J. = Erving, J. (exclude prefix)
Obama, Barak Hussein = Obama, Barak (exclude middle name)
Pufnstuf, H.R. = Pufnstuf, Haibane Renmei (match abbreviations)
Tankengine, Thomas = Tankengine, Tom (match common nicknames)
Flair, Rick "the Natureboy" = Flair, Natureboy (match on nickname)

推荐答案

我不得不使用各种建议的技术.感谢为我指出正确的方向.希望以下内容可以帮助其他人解决此类问题.

I had to use a variety of techniques suggested. Thanks pointing me in the right direction(s). Hopefully, the following will help someone else out with this type of problem to solve.

删除多余的字符

CREATE FUNCTION [dbo].[fn_StripCharacters]
(
    @String NVARCHAR(MAX), 
    @MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    SET @MatchExpression =  '%['+@MatchExpression+']%'

    WHILE PatIndex(@MatchExpression, @String) > 0
        SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')

    RETURN @String

END

用法:

--remove all non-alphanumeric and non-white space  
dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')  

将名称分为几部分

CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
RETURNS @ResultList TABLE 
    (
        [ID] VARCHAR(MAX),
        [Val] VARCHAR(MAX)
    )
AS
BEGIN

declare @OuterCursor cursor
declare @ID varchar(max)
declare @Val varchar(max)

set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY

open @OuterCursor

fetch next from @OuterCursor into @ID, @Val

while (@@FETCH_STATUS=0)
begin

    INSERT INTO @ResultList (ID, Val)   
    select @ID, split.s from dbo.Split(@sep, @Val) as split 
           where len(split.s) > 0

    fetch next from @OuterCursor into @ID, @Val
end

close @OuterCursor
deallocate @OuterCursor 

CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
RETURNS table
AS
RETURN (
    WITH Pieces(pn, start, stop) AS (
      SELECT 1, 1, CHARINDEX(@sep, @s)
      UNION ALL
      SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
      FROM Pieces
      WHERE stop > 0
    )
    SELECT pn,
      LTRIM(RTRIM(SUBSTRING(@s, start, 
             CASE WHEN stop > 0 
                  THEN stop-start 
                  ELSE 8000 
             END))) AS s
    FROM Pieces
  )

RETURN

用法:

--create split name list
DECLARE @NameList StringList 

INSERT INTO @NameList (ID, Val)
SELECT id, firstname FROM dbo.[User] u
WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0 

----remove split dups
select u.ID, COUNT(*)
from dbo.import_SplitTable(' ', @NameList) splitList
INNER JOIN dbo.[User] u
ON splitList.id = u.id

常用昵称:

我创建了一个基于此列表的表,并使用了该表以通用名称进行连接等价物.

I created a table based on this list and used it to join on common name equivalents.

用法:

SELECT u.id
, u.FirstName
, u_nickname_maybe.Name AS MaybeNickname
, u.LastName
, c.ID AS ContactID from
FROM dbo.[User] u 
INNER JOIN nickname u_nickname_match
ON u.FirstName = u_nickname_match.Name
INNER JOIN nickname u_nickname_maybe
ON u_nickname_match.relatedid = u_nickname_maybe.id
LEFT OUTER JOIN
(
    SELECT c.id, c.LastName, c.FirstName, 
         c_nickname_maybe.Name AS MaybeFirstName
    FROM dbo.Contact c
    INNER JOIN nickname c_nickname_match
    ON c.FirstName = c_nickname_match.Name
    INNER JOIN nickname c_nickname_maybe
    ON c_nickname_match.relatedid = c_nickname_maybe.id
    WHERE c_nickname_match.Name <> c_nickname_maybe.Name
) as c
ON c.AccountHolderID = ah.ID 
       AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
WHERE u_nickname_match.Name <> u_nickname_maybe.Name

语音算法(Jaro Winkler):

惊人的文章,超越SoundEx-MS SQL Server中的模糊搜索功能,说明如何在SQL Server中安装和使用 SimMetrics 库.该库使您可以找到字符串之间的相对相似性,并包括许多算法.我最终大多使用 Jaro Winkler 来匹配名称.

The amazing article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms. I ended up mostly using Jaro Winkler to match the names.

用法:

SELECT
u.id AS UserID
,c.id AS ContactID
,u.FirstName
,c.FirstName 
,u.LastName
,c.LastName
,maxResult.CombinedScores
 from
(
    SELECT
      u.ID
    , 
        max(
            dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))  
            * dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
        ) AS CombinedScores
    FROM dbo.[User] u, dbo.[Contact] c
    WHERE u.ContactID IS NULL
    GROUP BY u.id
) AS maxResult
INNER JOIN dbo.[User] u
ON maxResult.id  = u.id
INNER JOIN dbo.[Contact] c
ON maxResult.CombinedScores = 
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName)) 
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))

这篇关于根据人员姓名匹配记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆