如何规范化大型用户生成的公司名称数据集? [英] how do I normalize a large, user-generated data-set of company names?

查看:136
本文介绍了如何规范化大型用户生成的公司名称数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用案例:用户1上传100个公司名称(例如Microsoft,Bank of Sierra)

Use case: User 1 uploads 100 company names (e.g. Microsoft, Bank of Sierra)

用户2上传了100个公司名称(例如差距,用户声明, Inc。)

User 2 uploads 100 company names (e.g. The Gap, Uservoice, Microsoft, Inc.)

我希望用户1的Microsoft概念和用户2的Microsoft概念映射到具有Microsoft唯一索引的集中维护的实体。

I want User 1's notion of Microsoft and User 2's notion of Microsoft to map to a centrally maintained entity with a unique index for Microsoft.

如果有人上传不在中央存储库中的名称,我想我希望按原样输入。但是,如果第一个条目拼写不正确(例如Vergin Mobile,而不是Virgin Mobile)会怎么样?)我们如何最好地纠正它并将新上传与相同的索引相关联?

If someone uploads a name which isn't in the central repository, I guess I'd like it to be entered as is. But then what happens if that first entry is incorrectly spelled (e.g. Vergin Mobile instead of Virgin Mobile?) How can we best correct it and correlate new uploads to that same index?

技术上,中央存储库应该是一个单独的数据库吗?甚至用户生成的信息是否应该在一个单独的数据库中,以及与之相反的业务交易?

Technically, should the central repository be a separate database altogether? Should even the user generated information be in a separate database, as well, from the business transactions that will occur against it?

从大量的问题定义开始,希望与您的输入一块,谢谢。

Starting out with a large definition of the problem and hoping to chunk it up with your input, thanks.

推荐答案

FWIW,这与数据库规范化。这是一个数据清理任务。

FWIW, this has nothing to do with database normalization. This is a data cleanup task.

在一般情况下,数据清理不能完全自动化。许多人尝试,但是无法检测输入数据可能格式错误的所有方式。您可以使用以下技术自动化一些百分比的案例:

Data cleanup cannot be fully automated in the general case. Many people try, but it's impossible to detect all the ways that the input data might be malformed. You can automate some percentage of the cases with techniques such as:


  • 强制用户从列表中选择公司名称,而不是键入它们。对于单个条目,而不是大量上传,这是最好的。

  • 将输入公司名称的 SOUNDEX SOUNDEX 已经在数据库中的公司名称。这对于识别可能的匹配是有用的,但它也可以给出假阳性。所以您需要一个人来审查它们。

  • Force users to select company names from a list instead of typing them. Of course this is best for single entries, not for bulk uploads.
  • Compare the SOUNDEX of the input company names to the SOUNDEX of company names already in the database. This is useful for identifying possible matches, but it can also give false positives. So you need a human to review them.

最终,您需要设计软件,以便管理员轻松地合并条目(并更新来自其他数据库表的任何引用),因为它们被发现是彼此的重复。没有优雅的方式来实现这一点,使用级联的外键,你只需要写一堆UPDATE语句。

Ultimately, you need to design your software to make it easy for an administrator to "merge" entries (and update any references from other database tables) as they are discovered to be duplicates of one another. There's no elegant way to do this with cascading foreign keys, you just have to write a bunch of UPDATE statements.

这篇关于如何规范化大型用户生成的公司名称数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆