帮我删除提供的数据集 [英] Help me de-duplicate a provided dataset

查看:79
本文介绍了帮我删除提供的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

重复提供的数据集



如果满足以下所有条件,则记录将被视为重复记录:



a。姓氏完全匹配

b。名字模糊/类似匹配(创造点在这里)

c。以下一项或多项的完全匹配:

1.电子邮件地址

2.完整邮寄地址

3.电话号码



一旦??识别出重复记录的数量,需要将它们合并到每个组的单个记录中,并以合理的方式合并数据,以便我们拥有尽可能完整的属性集。



示例:a。如果两个重复记录共享一个电子邮件地址,但只有一个具有完整的邮寄地址,则生成的合并记录应包含电子邮件地址和邮寄地址。

b。如果两个重复记录具有以下其中一个的不同值,则合并记录应使用由ModifiedOn和/或CreatedOn时间戳值标识的更新近的属性



姓名

电子邮件地址

完整邮寄地址

电话号码

由此产生的重复数据删除Master记录需求要附加到源数据集,给定一个唯一的整数ID(您可以根据自己的喜好播种),然后将该新标识符指定为子复制源记录的ParentID。



保存并将(现在更大的)数据集作为.csv文件返回



初始数据的csv文件位于以下列:

ID CreatedOn ModifiedOn Customer_LastName Customer_FirstName Customer_AddressLine1 Customer_City Customer_State Customer_Zip Customer_HomePhone Customer_InternetEmail



我尝试过:



>尝试将包含数据的csv文件解析为数据表并根据要求进行过滤。

>尝试将数据导入SQL并使用ADO.net过滤掉查询。

De-duplicate the provided dataset

Records will be considered duplicates if they meet all of the following conditions:

a. Last Name exact match
b. First Name fuzzy / similar match (points for creativity here)
c. Any exact match of one or more of the following:
1. Email Address
2. Full mailing Address
3. Phone Number

Once the ?? number of duplicate records are identified, they need to be merged into a single record per group, and the data merged in such a way that we have the most complete set of attributes as possible.

Example: a. If two duplicate records share an email address, but only one has a full mailing address, the resultant merged record should have both the email and the mailing address.
b. If two duplicate records have different values for one of the following, the merged record should use the more recent attribute as identified by the ModifiedOn and/or CreatedOn timestamp values

First Name
Email Address
Full Mailing Address
Phone Number
The resulting de-duplicated "Master" record needs to be appended to the source dataset, given a unique integer ID (you can seed this however you like), and then that new identifier assigned as the ParentID of the child duplicated source records.

Save and return the (now larger) dataset as a .csv file

The csv file with initial data hasthe below columns :
ID CreatedOn ModifiedOn Customer_LastName Customer_FirstName Customer_AddressLine1 Customer_City Customer_State Customer_Zip Customer_HomePhone Customer_InternetEmail

What I have tried:

> Tried parsing the csv file which contains the data into data table and filtered based on the requirements.
> Tried importing data to SQL and using the ADO.net to filter out the queries.

推荐答案

这里有一些东西 - 尽可能接近答案 - 但是你需要弄清楚如何使用它。



TSQL:查询如何使用DISTINCT



在看到它给出的内容后,您将弄清楚如何使用它。
Here's something - about as close to an answer as you'll get - but you'll need to figure out how to use it.

TSQL: lookup how to use DISTINCT

After you see what it gives you figure out how to make use of it.


这篇关于帮我删除提供的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆