在SQL 2008R2中规范化人名 [英] Normalizing people names in SQL 2008R2

查看：87 发布时间：2019/6/7 22:54:32 SQL SQL-server-2008R2

本文介绍了在SQL 2008R2中规范化人名的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个带有全名字段的数据库，其中包含以下内容：

I have a database with a "full name" field that can have the following content:

SMITH, J
SMITH, JOHN
SMITH, JOHN M
SMITH, JOHN MATHEW
SMITH, JOHN MATHEW MAJ
SMITH, J M MAJ
SMITH,J
SMITH,JOHN
JOHN M SMITH
JOHN MATHEW SMITH
JOHN MATHEW SMITH MAJ
SMITH J

在另一个数据库中，有一个全名域将包含以下两种格式中的一种（但只有其中一种）

In another database, there is a full name field that will contain one of the two following formats (but only one of them)

SMITH,JOHN
SMITH,JOHN M
SMITH,JOHN MATHEW

我设法在第一个数据库中规范化各种格式，这样至少每件事都是这样的：

I've managed to "normalize" the various formats in the first database so that at least every thing looks like this:

SMITH,J
SMITH,JOHN
SMITH,JOHN M
SMITH,J M

和第二个数据库中的所有内容对此（再次，它将是一个或另一个取决于该人是否在该表中指定了中间名）：

and everything in the 2nd database to this (again, it will be one or the other depending on if the person has a middle name specified in that table):

SMITH,JOHN M
SMITH,JOHN

当我在全名字段上加入表格时，我得到了高度匹配，但与此同时，还有一定数量的不匹配（正如你猜的那样）。

我的新问题是这个 - 尝试自动确定如何按摩所有（或尽可能多）的最佳方法是什么第一个表中的名称与第二个表中使用的版本相同。

我认为可以按摩的唯一名称是那些最接近第二个表中名字的那个，但我怎么能在SQL中做到这一点？

如果他们（这个匹配的东西会更容易）导入数据的来源）包括数据中可用的数字personID字段（存在于第二个表中），但它们不是，我们还没有发现它们是否可以。如果他们可以做到的话，所有这些工作都将是徒劳的。

就我到目前为止，它需要一个相当大的SQL来规范化名称，这涉及 REPLACE的大量使用， TRIM ， LEFT ， RIGHT 和 REVERSE ，适当的洒水 CASE WHEN 以获得良好的衡量标准。

我尝试了什么：

我还没有尝试任何东西，因为我甚至没有知道是否有可能做我需要做的事情，因为它听起来更像是模糊逻辑问题而不是其他任何问题。我厌恶根据一些任意规则集进行猜测的想法，这些规则集在每次新数据导入时都会变得无用。

When I join the tables on full name fields, I get a high degree of matches, but at the same time, there are a moderately significant number of non-matches (as you might guess).

My new problem is this - what is the best way to try and automagically determine how to massage all (or as many as possible) of the names in the first table to the version being used in the 2nd table.

The way I have it figured is that the only names that could possibly be massaged are the ones that are closest to the name in the 2nd table, but how would I do that in SQL?

This matching stuff would be a whole lot easier if they (the source of the imported data) included an available numeric personID field (that exists in the 2nd table) in the data, but they aren't, and we haven't yet found out if they can. If they can and do, all of this work will have been for naught.

It required a SIGNIFICANT amount of SQL to normalize the names as far as I've gotten to this point, which involved numerous uses of REPLACE, TRIM, LEFT, RIGHT, and REVERSE, with a moderate sprinkling of CASE WHEN for good measure.

What I have tried:

I haven't tried anything yet because I don't even know if it's possible to do what I need to do, because it sounds more like a fuzzy logic problem than anything else. I abhor the thought of guessing based on some arbitrary ruleset that could be rendered useless with every new data import.

在SQL 2008R2中规范化人名 [英] Normalizing people names in SQL 2008R2

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

在SQL 2008R2中规范化人名 [英] Normalizing people names in SQL 2008R2

问题描述

推荐答案

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭