删除日期最早的重复行 [英] Deleting duplicate row that has earliest date

查看:35
本文介绍了删除日期最早的重复行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为 PF_temo 的表,它具有以下结构:

I have a table called PF_temo that has the following structure:

  • 名字
  • 中间名
  • 姓氏
  • 出生
  • 地址
  • 城市
  • 状态
  • 电话
  • 有效期

它有许多相同的行,除了有效日期.例如:

It has many rows that are identical, except for the validity date. For example:

steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201609
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201002
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201706
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,199812

我想运行一个脚本来删除除最后一列 (validitydate) 之外的所有匹配项的所有重复项,只留下表中的下方,这是 的最近有效日期201706:

I'd like to run a script that deletes all duplicates matched on everything but the last column (validitydate) leaving just the below in the table, which is the most recent validitydate of 201706:

steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201706

这就是我所拥有的;虽然它抛出了一个异常:

This is what I have; though it throws an exception:

DELETE 
FROM PF_temp
LEFT OUTER JOIN 
(
    SELECT Min(ValidityDate) as RowId
    , firstname
    , middlename
    , lastname
    , DOB
    , address
    , city
    , state 
    , phone
    FROM PF_temp 
    GROUP BY firstname
    , middlename
    , lastname
    , DOB
    , address
    , city
    , state 
    , phone
    , validitydate 
) as KeepRows 
ON TableName.RowId = KeepRows.RowId
WHERE KeepRows.RowId IS NULL

它不起作用,实际上弹出了这个错误:

It doesn't work and actually pops this error:

Msg 156, Level 15, State 1, Line 2
Incorrect syntax near the keyword 'OUTER'.
Msg 156, Level 15, State 1, Line 4
Incorrect syntax near the keyword 'as'.

另外,我想根据字母表的最后一个字母分阶段运行它.所以像 WHERE lastname like 'A%' 之类的东西需要在某处添加.

Also, I'd like to run it in stages based on the last letter of the alphabet. So something like WHERE lastname like 'A%' needs to be added somewhere.

推荐答案

试试这个:

delete a
from PF_Temp a
inner join PF_Temp b 
on  b.firstname = a.firstname 
and b.middlename = a.middlename
and b.lastname = a.lastname
and b.DOB = a.DOB
and b.address = a.address
and b.city = a.city
and b.state = a.state
and b.phone = a.phone
and b.validitydate > a.validitydate

SQL Fiddle 中的示例.

以上作品由:

  • 加入所有匹配的字段(有效日期除外),从而在 a 中捕获所有重复的记录.在这个阶段,我们捕获所有记录,因为 a 中的记录将与 b 中的记录匹配.
  • 通过指定b中的validitydate必须大于a中的validitydate,我们都避免了上述记录相同的问题(因为如果它是相同的记录,则有效日期将是相同的),并且如果 a 中的记录是最新的,则还确保没有匹配;因为 b 中将不存在匹配项(即没有具有更长有效期的记录).
  • 然后我们删除a返回的每条记录;即每条记录都有重复的有效日期.
  • joining on all matching fields (except validity date), thus capturing in a all records which have duplicates. At this stage we capture all records, since the record in a would match with itself in b.
  • By specifying that the validitydate in b must be greater than that in a we both avoid the above issue of the record being the same (since if it were the same record, the validity date would be the same), and also ensuring that there's no match if the record in a is the most recent; since there will be no match in b (i.e. no record with a greater validity date).
  • we then delete every record which was returned by a; i.e. every record which has a duplicate with a later validity date.

如果您只想删除那些具有特定姓氏的重复项,请完全按照上面所说的进行操作;即添加行 where a.LastName like 'A%'.

If you want to only delete those duplicates with a specific last name, you do exactly what you said above; i.e. add the line where a.LastName like 'A%'.

更新

您提到某些列可能包含 null.这是上述内容的修订版本,以考虑到 null != null.

You mention that some columns may contain nulls. Here's a revised version of the above to take into account that null != null.

delete a
from PF_Temp a
inner join PF_Temp b 
on  ((b.firstname = a.firstname) or (b.firstname is null and a.firstname is null))
and ((b.middlename = a.middlename) or (b.middlename is null and a.middlename is null))
and ((b.lastname = a.lastname) or (b.lastname is null and a.lastname is null))
and ((b.DOB = a.DOB) or (b.DOB is null and a.DOB is null))
and ((b.address = a.address) or (b.address is null and a.address is null))
and ((b.city = a.city) or (b.city is null and a.city is null))
and ((b.state = a.state) or (b.state is null and a.state is null))
and ((b.phone = a.phone) or (b.phone is null and a.phone is null))
and b.validitydate > a.validitydate

上述的替代方法是 oncoalesce(b.firstname,'') =coalesce(a.firstname)(对所有其他匹配字段重复该模式);尽管这意味着 null 和空白被同等对待,并且不会表现得很好.

An alternative to the above would be on coalesce(b.firstname,'') = coalesce(a.firstname) (repeating that pattern for all other matching fields); though that would mean that nulls and blanks were treated the same, and wouldn't perform quite so well.

替代方法

另一种更宽容空值的方法是使用子查询拉回所有值,使用匹配值对每个集合进行编号,从 1 开始表示最近的有效日期.然后我们删除所有返回的数字大于 1 的行;即任何具有较早有效期的重复项.

A different approach, which is more forgiving of nulls, is to use a subquery to pull back all values, numbering each set with matching values, starting at 1 for the most recent validity date. We then delete all those rows which came back with numbers higher than 1; i.e. any which are duplicates with earlier validity dates.

delete TheDeletables
from 
(
    select *
    , row_number() over (
        partition by 
         firstname 
        , middlename 
        , lastname 
        , DOB 
        , address  
        , city 
        , state 
        , phone 
        order by validitydate desc
    ) rowid
    from PF_Temp
) TheDeletables
where rowid > 1;

演示 SQL Fiddle.

这篇关于删除日期最早的重复行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆