在SQL 2008R2中规范化人名 [英] Normalizing people names in SQL 2008R2

查看:87
本文介绍了在SQL 2008R2中规范化人名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带有全名字段的数据库,其中包含以下内容:

I have a database with a "full name" field that can have the following content:

SMITH, J
SMITH, JOHN
SMITH, JOHN M
SMITH, JOHN MATHEW
SMITH, JOHN MATHEW MAJ
SMITH, J M MAJ
SMITH,J
SMITH,JOHN
JOHN M SMITH
JOHN MATHEW SMITH
JOHN MATHEW SMITH MAJ
SMITH J



在另一个数据库中,有一个全名域将包含以下两种格式中的一种(但只有其中一种)


In another database, there is a full name field that will contain one of the two following formats (but only one of them)

SMITH,JOHN
SMITH,JOHN M
SMITH,JOHN MATHEW



我设法在第一个数据库中规范化各种格式,这样至少每件事都是这样的:


I've managed to "normalize" the various formats in the first database so that at least every thing looks like this:

SMITH,J
SMITH,JOHN
SMITH,JOHN M
SMITH,J M



和第二个数据库中的所有内容对此(再次,它将是一个或另一个取决于该人是否在该表中指定了中间名):


and everything in the 2nd database to this (again, it will be one or the other depending on if the person has a middle name specified in that table):

SMITH,JOHN M
SMITH,JOHN



当我在全名字段上加入表格时,我得到了高度匹配,但与此同时,还有一定数量的不匹配(正如你猜的那样)。



我的新问题是这个 - 尝试自动确定如何按摩所有(或尽可能多)的最佳方法是什么第一个表中的名称与第二个表中使用的版本相同。



我认为可以按摩的唯一名称是那些最接近第二个表中名字的那个,但我怎么能在SQL中做到这一点?



如果他们(这个匹配的东西会更容易)导入数据的来源)包括数据中可用的数字personID字段(存在于第二个表中),但它们不是,我们还没有发现它们是否可以。如果他们可以做到的话,所有这些工作都将是徒劳的。



就我到目前为止,它需要一个相当大的SQL来规范化名称,这涉及 REPLACE的大量使用 TRIM LEFT RIGHT REVERSE ,适当的洒水 CASE WHEN 以获得良好的衡量标准。



我尝试了什么:



我还没有尝试任何东西,因为我甚至没有知道是否有可能做我需要做的事情,因为它听起来更像是模糊逻辑问题而不是其他任何问题。我厌恶根据一些任意规则集进行猜测的想法,这些规则集在每次新数据导入时都会变得无用。


When I join the tables on full name fields, I get a high degree of matches, but at the same time, there are a moderately significant number of non-matches (as you might guess).

My new problem is this - what is the best way to try and automagically determine how to massage all (or as many as possible) of the names in the first table to the version being used in the 2nd table.

The way I have it figured is that the only names that could possibly be massaged are the ones that are closest to the name in the 2nd table, but how would I do that in SQL?

This matching stuff would be a whole lot easier if they (the source of the imported data) included an available numeric personID field (that exists in the 2nd table) in the data, but they aren't, and we haven't yet found out if they can. If they can and do, all of this work will have been for naught.

It required a SIGNIFICANT amount of SQL to normalize the names as far as I've gotten to this point, which involved numerous uses of REPLACE, TRIM, LEFT, RIGHT, and REVERSE, with a moderate sprinkling of CASE WHEN for good measure.

What I have tried:

I haven't tried anything yet because I don't even know if it's possible to do what I need to do, because it sounds more like a fuzzy logic problem than anything else. I abhor the thought of guessing based on some arbitrary ruleset that could be rendered useless with every new data import.

推荐答案

嗨John,

Hi John,
引用:

因为它听起来更像模糊逻辑问题而不是其他任何东西。

because it sounds more like a fuzzy logic problem than anything else.



我同意你有一定程度的猜测,如果你不知道确切的真实姓名没有重复或几乎重复,那么100%匹配是不可能的。

问题:

- 2个人可以拥有完全相同的名字。

- 2个不同名字的人可以写成相同的名字。 SMITH,JOHN MATHEW和SMITH,JOHN MAX可以导致SMITH,JOHN M无法区分。



我认为使用的问题没有100%准确的解决方案仅限名称。


I agree with you, there is a degree of guessing, 100% match is impossible if not on a short list where you know exact real names without duplicates or almost duplicates.
Problems:
- 2 individuals can have the exact same name.
- 2 individuals with different names can be written the same. SMITH, JOHN MATHEW and SMITH, JOHN MAX can lead to SMITH, JOHN M and be undistinguishable.

I think there is no 100% accurate solution for your problem by using the name only.


这篇关于在SQL 2008R2中规范化人名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆