在表中查找相似的联系人姓名 [英] Finding similar contact names within table

查看:34
本文介绍了在表中查找相似的联系人姓名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在执行数据清理,其中一项任务是删除类似的重复联系人.

I am performing data clean up and one of my tasks is to delete similar duplicate contacts.

示例:

BILL CROSBIE, BILL CROSBY, BILL CROSSBY; or KRISTEN HARRIS, KRISTIN HARIS. 

所以,没有确切的规则,但通过手动扫描,我可以看出它们非常相似,必须是重复的.

So, there is no exact rule, but by manually scanning this, I can tell that they are very similar and must be duplicates.

任何人都可以提供一个示例,说明我如何使用 SSIS 执行此操作.

Can anyone, provide an example of how I can do this using SSIS.

我知道我可以使用模糊查找,但它需要参考表或正确的参考数据,然后与需要数据清理的表进行比较.但是,是否有可能我可以使用 SSIS 中的脚本组件工具来使用获取匹配最多的字符的算法.那个 C# 代码会是什么样子?

I understand that I can use the fuzzy lookup, but it requires a reference table or a reference data that is correct and would then compare to the table that needs data cleanup. However, is there a possibility that I can use the script component tool in SSIS to use an alogirthm that gets the characters with most matches. What would that C# code look like?

我是 SSIS 的新手,没有太多经验.或者我可以在 MSSQL 中创建某种脚本来执行此操作吗?

I am new to using SSIS and don't have much experience. Or is there some sort of script I can create in MSSQL that can do this?

推荐答案

我会使用 SSIS Fuzzy Lookup 组件.我将使用您的 Contacts 表作为参考输入,并存储新索引(有效地创建一个输出表).我会配置组件的高级页面以允许多个匹配并降低相似度阈值.

I would use the SSIS Fuzzy Lookup component. I would use your Contacts table as the reference input, and store the new index (effectively creating an output table). I would configure the component's Advanced page to allow multiple matches and reduce the Similarity threshold.

执行后我会查询新的索引表,检查相似度和置信度分数.高于特定阈值(取决于您的数据)的分数将表示重复.

After executing I would query the new index table, examining the similarity and confidence scores. Scores above a certain threshold (depends on your data) would indicate a duplicate.

这篇关于在表中查找相似的联系人姓名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆