数据清理,从ALLCAPS转换为标题案例 [英] Data Cleanup, post conversion from ALLCAPS to Title Case

查看:114
本文介绍了数据清理,从ALLCAPS转换为标题案例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将人员和地址数据库从ALL CAPS转换为Title Case会创建大量不正确的大写字母/名称,例如:

MacDonald,PhD, CPA,III

有谁知道现有的脚本可以清理所有常见问题单词吗?当然,它仍然会留下一些错误(比较少见的类似驼峰式拼写的名字,例如MacDonalz)。

我认为这不重要,但数据当前驻留在MSSQL中。由于这是一次性工作,如果解决方案需要,我会将其导出到文本中。



有一个线索提出了一个相关的问题,有时会涉及到这个问题,但没有具体解决这个问题。你可以在这里看到:



解决方案

这是我寻找的答案:

有一家数据公司Melissa Data发布了一些用于数据库清理的API和应用程序 - 主要围绕直销行业。

我能够使用两个应用程序来解决我的问题。


  1. StyleList:这个应用程序以及其他
    的东西,将所有CAPS转换为混合
    的情况,并且在该过程中它不会
    弄脏数据,从而留下标题
    等作为CPA,MD,III等机构;
    以及天然的,常见的
    骆驼案例名称,例如麦当劳。

  2. Personator:我使用personator将Full Name字段分解为Prefix,First Name,中间名,姓氏和后缀。说实话,这并不完美,但我给出的数据非常具有挑战性(通常没有空格分隔中间名和后缀)。这个应用程序也做了一些其他有用的东西,包括为大多数名字分配性别。它也可以作为API调用。

以下是Melissa Data提供的解决方案的链接:



http://www.melissadata.com/dqt /index.htm



对于我来说,Melissa Data应用程序完成了大量繁重的工作,其余的脏数据通过报告可以在SQL中识别和修复左边x或右边x计数 - 污垢通常具有最小的唯一性,模式很容易被发现和修复。


Converting a database of people and addresses from ALL CAPS to Title Case will create a number of improperly capitalized words/names, some examples follow:

MacDonald, PhD, CPA, III

Does anyone know of an existing script that will cleanup all the common problem words? Certainly, it will still leave some mistakes behind (less common names with CamelCase-like spellings, i.e. "MacDonalz").

I don't think it matters much, but the data currently resides in MSSQL. Since this is a one-time job, I'd export out to text if a solution requires it.

There is a thread that posed a related question, sometimes touching on this problem, but not addressing this problem specifically. You can see it here:

SQL Server: Make all UPPER case to Proper Case/Title Case

解决方案

Here is the answer I was looking for:

There is a data company, Melissa Data, who publishes some API and applications for database cleanup -- geared mostly around the direct marketing industry.

I was able to use two applications to solve my problem.

  1. StyleList: this app, among other things, converts ALL CAPS to mixed case and in the process it does not dirty up the data, leaving titles such as CPA, MD, III, etc. in tact; as well as natural, common camel-case names such as McDonalds.
  2. Personator: I used personator to break the Full Name fields into Prefix, First Name, Middle Name, Last Name, and Suffix. To be honest, it was far from perfect, but the data I gave it was pretty challenging (often no space separating a middle name and a suffix). This app does a number of other usefult things as well, including assigning gender to most names. It's available as an API you can call, too.

Here is a link to the solutions offered by Melissa Data:

http://www.melissadata.com/dqt/index.htm

For me, the Melissa Data apps did much of the heavy lifting and the remaining dirty data was identifiable and fixable in SQL by reporting on LEFT x or RIGHT x counts -- the dirt typically has the least uniqueness, patterns easily discovered and fixed.

这篇关于数据清理,从ALLCAPS转换为标题案例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆